IDA Framework
The IDA framework consists of six steps [Huebner et al 2018, Figure 1], here we assume that metadata (step I) exist in sufficient detail, and that data cleaning (step II) was already performed. Metadata summarize background information about the data to properly conduct IDA steps, and a data cleaning process identifies and corrects technical errors. The data screening (step III) examines data properties to inform decisions about the intended analysis. Initial data reporting (step IV) document insight of the previous steps and can be referred to when interpreting results from the regression modeling. Consequences of these analyses can be that the analysis plan needs to be refined or updated (step V). Finally, reporting of IDA results in research papers (step VI) are necessary to ensure transparency regarding key findings that influence the analysis or interpretation of results. Further details about the elements of IDA are discussed in [TG3 papers].
IDA framework
References
Huebner M, le Cessie S, Schmidt CO, Vach W . A contemporary conceptual framework for initial data analysis. Observational Studies 2018; 4: 171-192. Link
Huebner M, Vach W, le Cessie S, Schmidt C, Lusa L. Hidden Analyses: a review of reporting practice and recommendations for more transparent reporting of initial data analyses. BMC Med Res Meth 2020; 20:61. Link
Scope of the regression analyses for the examples
Regression models can be used for a wide range of purposes, for the purpose of these examples the assumptions on the regression analysis set-up in this paper are listed in Table 1. Thus, IDA tasks will be explained in a well-defined, practically relevant setting. Since a key principle is that IDA does not touch the research question no associations between dependent (outcome) and independent (non-outcome) variables are considered.
Table 1: The scope of the regression analyses considered for IDA tasks
| Aspects of the research plan | Assumptions in this paper | Reason for the assumption |
|---|---|---|
| Dependent (outcome) variable | One dependent variable that can be continuous or binary; exclude time-to-event or longitudinal outcomes | Explain IDA tasks in a well-defined, practically relevant setting |
| Regression models | Models with linear predictors | Explain IDA tasks in a well-defined, practically relevant setting |
| Purpose of regression model | Adjust effect of one variable of interest for confounders; quantify the effects of explanatory variables on the outcome | Explain IDA tasks in a well-defined, practically relevant setting |
| Independent variables | “explanatory” or “confounder” depending on purpose of model; small to moderate number of mixed types; Not high dimensional; no repeated measurements | To demonstrate IDA approaches for a mix of variables likely to be encountered in practice |
| Statistical analysis plan | Exists, defines the outcome variable, the type of regression model to be used, and a set of independent variables | IDA does not touch the research question, but may lead to an update or refinement of the analysis plan |
References:
Vach W. Regression Models as a Tool in Medical Research. Chapman/Hall CRC 2012
Harrell FE. Regression Modeling Strategies. Springer (2nd ed) 2015
Royston P and Sauerbrei W. Multivariable Model Building. Wiley (2008)
[…]
Data screening and possible actions
Univariate distributions
| What to look at | Possible actions: Interpretation | Possible actions: SAP | Possible actions: Presentation | |
|---|---|---|---|---|
| Continuous variables | General skewness | Help in interpreting results | Update SAP | Update intended presentation of results |
| Continuous variables | General skewness | Wide CI for coefficients | Use variable as log-transformed | Update intended presentation of results |
| Continuous variables | Outliers | Disproportional impact on results | Winsorize or transform | Model involves winsorization |
| Continuous variables | Spike at 0 | Narrow CI at 0 | Use appropriate representation of variable in model | Use 2 (or more) coefficients to distinguish 0 from non-0 continuous part |
| Categorical variables | Frequencies | Comparisons to default reference probably irrelevant | Change reference category | Contrasts compare to (new) reference category |
| Categorical variables | Rare categories | Wide CI for coefficients | Collapse/exclude | Fewer categories to present |
| Categorical variables | One very frequent category | Comparisons irrelevant? | Exclude variable | Variable omitted |
Bivariate distributions
| What to look at | Possible actions: Interpretation | Possible actions: SAP | Possible actions: Presentation | |
|---|---|---|---|---|
| Continuous by continuous | Outliers (from the cloud) | Disproportional impact on results | Winsorize or transform | Model involves winsorization |
| Continuous by continuous | Correlations | Wide CI for coefficients | Winsorize or transform | Model involves winsorization |
| Continuous by categorical | Outliers (only visible in bivariate plot) | Wide CI for coefficients | ||
| Categorical by categorical | Frequent/rare combinations | Comparison to default reference irrelevant | Change reference category | Contrasts compare to (new) reference category |
| Categorical by categorical | Frequent/rare combinations | interactions relevant? | Remove interaction from model | Fewer interactions to present |
Missing values
| What to look at | Possible actions: Interpretation | Possible actions: SAP | Possible actions: Presentation | |
|---|---|---|---|---|
| Per variable | Number and proportion | Wide CI for coefficients | Remove variable if many missing values | |
| Pattern | Variables missing independently or together | Omit variables together | Changes model | |
| Pattern | Variables missing dependent on levels of other variables | Systematic missingness? Model still based on representative? | IPW needed? | Weighted analysis |
| Complete cases | Number and proportion | Few cases left for main CCO analysis | Multiple imputation (or other way of dealing with missing values)? | Result from MI analysis? Or applicability restricted to a subpopulation? |
References
Huebner M, le Cessie S, Schmidt CO, Vach W . A contemporary conceptual framework for initial data analysis. Observational Studies 2018; 4: 171-192. Link
Harrell FE. Regression Modeling Strategies. Springer (2nd ed) 2015
[…]
CRASH-2
Introduction to CRASH-2
Since a key principle of IDA is not to touch the research questions, before IDA commences the research aim and statistical analysis plan need to be in place. IDA may lead to an update or refinement of the analysis plan. To demonstrate the workflow and content of IDA, we created a hypothetical research aim and corresponding statistical analysis plan, which is described in more detail in the section Crash2_SAP.Rmd.
Hypothetical research aim for IDA is to develop a multivariable model for early death (death within 28 days from injury) using nine independent variables of mixed type (continuous, categorical, semicontinuous) with the primary aim of prediction and a secondary aim of describing the association of each variable with the outcome.
A prediction model was developed and validated based on this data set in “Predicting early death in patients with traumatic bleeding” Perel et al, BMJ 2012, [supplement available at]. The assumed research aim is in line with the prediction model
CRASH-2 Description
Clinical Randomisation of an Antifibrinolyticin Significant Haemorrhage(CRASH-2) was a large randomised placebo controlled trial among trauma patients with, or at risk of, significant haemorrhage, of the effects of antifibrinolytic treatment on death and transfusion requirement. The study is described at the original trial website. A public version of the data set is found at a repository of public data sets hosted by the Vanderbilt University’s Department of Biostatistics (Prof. Frank Harrell Jr.).
The data set includes 20,207 patients and 44 variables.
Note: In contrast to the analysis described in Perel et al, variables describing the economic region and the treatment allocation are missing in the public version of the data set, and while the data set contains 20,207 patients, the research paper mentions 20,127 patients having been included in the study.
Crash2 dataset contents
Source dataset
We refer to the source data set as the dataset available online here
Display the source dataset contents. This dataset is in the data-raw folder of the project directory.
Data frame:crash2
20207 observations and 44 variables, maximum # NAs:17121| Name | Labels | Units | Levels | Class | Storage | NAs |
|---|---|---|---|---|---|---|
| entryid | Unique Numbers for Entry Forms | integer | integer | 0 | ||
| source | Method of Transmission of Entry Form to CC | 5 | integer | 0 | ||
| trandomised | Date of Randomization | Date | double | 0 | ||
| outcomeid | Unique Number From Outcome Database | integer | integer | 80 | ||
| sex | 2 | integer | 1 | |||
| age | integer | 4 | ||||
| injurytime | Hours Since Injury | numeric | double | 11 | ||
| injurytype | 3 | integer | 0 | |||
| sbp | Systolic Blood Pressure | mmHg | integer | integer | 320 | |
| rr | Respiratory Rate | /min | integer | integer | 191 | |
| cc | Central Capillary Refille Time | s | integer | integer | 611 | |
| hr | Heart Rate | /min | integer | integer | 137 | |
| gcseye | Glasgow Coma Score Eye Opening | integer | integer | 732 | ||
| gcsmotor | Glasgow Coma Score Motor Response | integer | integer | 732 | ||
| gcsverbal | Glasgow Coma Score Verbal Response | integer | integer | 735 | ||
| gcs | Glasgow Coma Score Total | integer | integer | 23 | ||
| ddeath | Date of Death | Date | double | 17121 | ||
| cause | Main Cause of Death | 7 | integer | 17118 | ||
| scauseother | Description of Other Cause of Death | 227 | integer | 0 | ||
| status | Status of Patient at Outcome if Alive | 3 | integer | 3169 | ||
| ddischarge | Date of discharge, transfer to other hospital or day 28 from randomization | Date | double | 3185 | ||
| condition | Condition of Patient at Outcome if Alive | 5 | integer | 3251 | ||
| ndaysicu | Number of Days Spent in ICU | numeric | double | 182 | ||
| bheadinj | Significant Head Injury | integer | integer | 80 | ||
| bneuro | Neurosurgery Done | integer | integer | 80 | ||
| bchest | Chest Surgery Done | integer | integer | 80 | ||
| babdomen | Abdominal Surgery Done | integer | integer | 80 | ||
| bpelvis | Pelvis Surgery Done | integer | integer | 80 | ||
| bpe | Pulmonary Embolism | integer | integer | 80 | ||
| bdvt | Deep Vein Thrombosis | integer | integer | 80 | ||
| bstroke | Stroke | integer | integer | 80 | ||
| bbleed | Surgery for Bleeding | integer | integer | 80 | ||
| bmi | Myocardial Infarction | integer | integer | 80 | ||
| bgi | Gastrointestinal Bleeding | integer | integer | 80 | ||
| bloading | Complete Loading Dose of Trial Drug Given | integer | integer | 80 | ||
| bmaint | Complete Maintenance Dose of Trial Drug Given | integer | integer | 80 | ||
| btransf | Blood Products Transfusion | integer | integer | 80 | ||
| ncell | Number of Units of Red Call Products Transfused | numeric | double | 9963 | ||
| nplasma | Number of Units of Fresh Frozen Plasma Transfused | integer | integer | 9964 | ||
| nplatelets | Number of Units of Platelets Transfused | integer | integer | 9964 | ||
| ncryo | Number of Units of Cryoprecipitate Transfused | integer | integer | 9964 | ||
| bvii | Recombinant Factor VIIa Given | integer | integer | 374 | ||
| boxid | Treatment Box Number | integer | integer | 0 | ||
| packnum | Treatment Pack Number | integer | integer | 0 |
| Variable | Levels |
|---|---|
| source | telephone |
| telephone entered manually | |
| electronic CRF by email | |
| paper CRF enteredd in electronic CRF | |
| electronic CRF | |
| sex | male |
| female | |
| injurytype | blunt |
| penetrating | |
| blunt and penetrating | |
| cause | bleeding |
| head injury | |
| myocardial infarction | |
| stroke | |
| pulmonary embolism | |
| multi organ failure | |
| other | |
| scauseother | |
| Acute Hypoxia | |
| ACUTE LUNG INJURY | |
| Acute Pulmonary Oedema | |
| Acute Renal Failure | |
| ACUTE RESPIRATORY DISTRESS SYNDROME (ARDS) | |
| acute respiratory failure | |
| acute respiratory failure+sepsis | |
| air amboli (embolism) | |
| Air embolism caused by penetrating lung trauma | |
| ... | |
| status | discharged |
| still in hospital | |
| transferred to other hospital | |
| condition | no symptoms |
| minor symptoms | |
| some restriction in lifestyle but independent | |
| dependent, but not requiring constant attention | |
| fully dependent, requiring attention day and night |
Updated analysis dataset
Additional meta-data is added to the original source data set. We write this new modified data set back to the data folder after adding additional meta-data for the following variables:
- age - add label “Age” and unit “years”.
- injury time - add unit “hours”.
- total Glasgow coma score - add unit “points”.
At the stage we select the variables of interest to take in to the IDA phase by dropping variables we do not check in IDA.
As a cross check we display the contents again to ensure the additional data is added, and then write back the changes to the data folder in the file “data/a_crash2.rds”.
Input object size: 1221480 bytes; 12 variables 20207 observations New object size: 1223272 bytes; 12 variables 20207 observations Input object size: 1546808 bytes; 14 variables 20207 observations New object size: 1385720 bytes; 14 variables 20207 observations
Data frame:a_crash2
20207 observations and 14 variables, maximum # NAs:17121| Name | Labels | Units | Levels | Class | Storage | NAs |
|---|---|---|---|---|---|---|
| entryid | Unique Numbers for Entry Forms | integer | integer | 0 | ||
| trandomised | Date of Randomization | Date | double | 0 | ||
| ddeath | Date of Death | Date | double | 17121 | ||
| age | Age | years | integer | integer | 4 | |
| sex | Sex | 2 | integer | 1 | ||
| sbp | Systolic Blood Pressure | mmHg | integer | integer | 320 | |
| hr | Heart Rate | /min | integer | integer | 137 | |
| rr | Respiratory Rate | /min | integer | integer | 191 | |
| gcs | Glasgow Coma Score Total | points | integer | integer | 23 | |
| cc | Central Capillary Refille Time | s | integer | integer | 611 | |
| injurytime | Hours Since Injury | hours | numeric | double | 11 | |
| injurytype | Injury type | 3 | integer | 0 | ||
| time2death | integer | 17121 | ||||
| earlydeath | Death within 28 days from injury | integer | integer | 0 |
| Variable | Levels |
|---|---|
| sex | male |
| female | |
| injurytype | blunt |
| penetrating | |
| blunt and penetrating |
Section session info
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18363)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] Hmisc_4.4-2 Formula_1.2-4 survival_3.1-12 lattice_0.20-41
## [5] forcats_0.5.1 stringr_1.4.0 dplyr_1.0.4 purrr_0.3.4
## [9] readr_1.4.0 tidyr_1.1.2 tibble_3.0.6 ggplot2_3.3.3
## [13] tidyverse_1.3.0 here_1.0.1
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.6 lubridate_1.7.9.2 png_0.1-7
## [4] assertthat_0.2.1 rprojroot_2.0.2 digest_0.6.27
## [7] R6_2.5.0 cellranger_1.1.0 backports_1.2.1
## [10] reprex_1.0.0 evaluate_0.14 httr_1.4.2
## [13] pillar_1.4.7 rlang_0.4.10 readxl_1.3.1
## [16] data.table_1.13.6 rstudioapi_0.13 rpart_4.1-15
## [19] Matrix_1.2-18 checkmate_2.0.0 rmarkdown_2.6
## [22] splines_4.0.2 foreign_0.8-80 htmlwidgets_1.5.3
## [25] munsell_0.5.0 broom_0.7.4 compiler_4.0.2
## [28] modelr_0.1.8 xfun_0.20 pkgconfig_2.0.3
## [31] base64enc_0.1-3 htmltools_0.5.1.1 nnet_7.3-14
## [34] tidyselect_1.1.0 htmlTable_2.1.0 gridExtra_2.3
## [37] bookdown_0.21 crayon_1.4.1 dbplyr_2.1.0
## [40] withr_2.4.1 grid_4.0.2 jsonlite_1.7.2
## [43] gtable_0.3.0 lifecycle_0.2.0 DBI_1.1.1
## [46] magrittr_2.0.1 scales_1.1.1 rmdformats_1.0.1
## [49] cli_2.3.0 stringi_1.5.3 fs_1.5.0
## [52] latticeExtra_0.6-29 xml2_1.3.2 ellipsis_0.3.1
## [55] generics_0.1.0 vctrs_0.3.6 RColorBrewer_1.1-2
## [58] tools_4.0.2 glue_1.4.2 hms_1.0.0
## [61] jpeg_0.1-8.1 yaml_2.2.1 colorspace_2.0-0
## [64] cluster_2.1.0 rvest_0.3.6 knitr_1.31
## [67] haven_2.3.1
Statistical analysis plan
Since a key principle of IDA is not to touch the research questions, before IDA commences the research aim and statistical analysis plan needs to be in place. IDA may lead to an update or refinement of the analysis plan. To demonstrate the workflow and content of IDA, we created a hypothetical research aim and corresponding statistical analysis plan.
Hypothetical research aim for IDA: Develop a multivariable model for early death (death within 28 days from injury) using nine independent variables of mixed type (continuous, categorical, semicontinuous) with the primary aim of prediction and a secondary aim of describing the association of each variable with the outcome.
The assumed analysis aim is in line with the prediction model presented by Perel et al, BMJ 2012, supplement available at.
Outcome variable
Early death, i.e. in-hospital death within 28 days from injury (binary variable)
Statistical methods
Logistic regression will be used to model early death by the following independent variables (measured at randomisation) deemed important to predict early death.
Demographic measurements:
- Age (
age, years) - Sex (
sex, male or female)
Physiological measurements:
- Systolic blood pressure (
sbp, mmHg) - Heart rate (
hr, 1/min) - Respiratory rate (
rr, 1/min) - Glasgow coma score (
gcs, points) - Central capillary refill time (
cc, seconds)
Characteristics of injury measurements:
- Time since injury (
injurytime, hours) - Type of injury (
injurytype, ‘blunt’, ‘penetrating’ or ‘blunt and penetrating’)
Restricted cubic splines with 3 degrees of freedom with knots set to default values will be used for continuous variables. As the final prediction model should be parsimonious enough to simplify its application, a backward elimination algorithm with a significance level set at \(\alpha=0.05\) will be applied to remove statistically insignificant effects. Finally, nonlinear representation of each continuous variable will be tested against linear representation at \(\alpha=0.05\). In case of lacking added value of a nonlinear effect, the model will be refitted with a linear effect for that variable.
Remarks
Regarding type of injury, the original paper describes its treatment in the model as follows: ‘Type of injury had three categories—-penetrating, blunt, or blunt and penetrating—but we analysed it as ’penetrating’ or ‘blunt and penetrating.’ ’ It is not clear from that description what happened to the ‘blunt’ group. (I assume they were collapsed with ‘blunt and penetrating’.) ** we are going to consider the three categories, and then check aout recommendations for the final analysis-MH**
The original paper describes the modeling approach as follows: ‘We used a backward step-wise approach. Firstly, we included all potential prognostic factors and interaction terms that users considered plausible. These interactions included all potential predictors with type of injury, time since injury, and age. We then removed, one at a time, terms for which we found no strong evidence of an association, judged according to the P values (<0.05) from the Wald test.’ This would mean they tested at least 24 interaction terms, each possibly using several degrees of freedom! In the final model, only an interaction of Glasgow coma score and type of injury was included.
Preparations
The outcome variable, early death (i.e., death within 28 days from injury) must be computed from the time span between date of death and date of randomization using the following logic:
- transform ddeath and trandomisation into an interpretable date format and then compute the difference
- interpret missing (i.e. NAs) as ‘not died within study period, at least not within 28 days’
- if patients died after 28 days, treat as alive
This can be derived using the following code logic:
## NOTE: This is for demostration purposes, this code is not run here.
## The derivation was executed earlier.
a_crash2$time2death <-
as.numeric(as.Date(a_crash2$ddeath) - as.Date(a_crash2$trandomised))
a_crash2$earlydeath[!is.na(a_crash2$time2death)] <-
(a_crash2$time2death[!is.na(a_crash2$time2death)] <= 28) + 0
# +0 to transform it from TRUE/FALSE to 1/0
# NA in time2death means alive at day 28
a_crash2$earlydeath[is.na(a_crash2$time2death)] <- 0 We also display the marginal distribution of the derived outcome variable.
a_crash2 %>%
dplyr::select(earlydeath) %>%
gtsummary::tbl_summary()| Characteristic | N = 20,2071 |
|---|---|
| Death within 28 days from injury | 3,076 (15%) |
|
1
n (%)
|
|
The number of deaths computed in the data set coincides with the number reported in Perel et al, BMJ 2012.
Sources
Data obtained from http://biostat.mc.vanderbilt.edu/wiki/Main/DataSets
To download the data set, click the link to data set
Data dictionary
The data dictionary can be found LINK
References
CRASH-2 Collaborators. Effects of tranexamic acid on death, vascular occlusive events, and blood transfusion in trauma patients with significant haemorrhage (CRASH-2): a randomised, placebo-controlled trial. Lancet 2010;376:23-32
Perel P, Prieto-Merino D, Shakur H, Clayton T, Lecky F, Bouamra O, Russell R, Faulkner M, Steyerberg EW, Roberts I. Predicting early death in patients with traumatic bleeding: development and validation of prognostic model. BMJ 2012; 345(aug15 1): e5166.
Missing data
Per variable missingness
Number and percentage of missing.
| Variable | Missing (count) | Missing (%) |
|---|---|---|
| cc | 611 | 3.02 |
| sbp | 320 | 1.58 |
| rr | 191 | 0.95 |
| hr | 137 | 0.68 |
| gcs | 23 | 0.11 |
| injurytime | 11 | 0.05 |
| age | 4 | 0.02 |
| sex | 1 | 0.00 |
| injurytype | 0 | 0.00 |
Missingness patterns over variables
(In)complete cases
This section presents patients with a least one missing value. First we list out patients with at least one missing value in a filterable table.
Then we report the pattern of missing for this set of patients.
Section session info
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18363)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] DT_0.17 kableExtra_1.3.1 gt_0.2.2 naniar_0.6.0
## [5] Hmisc_4.4-2 Formula_1.2-4 survival_3.1-12 lattice_0.20-41
## [9] forcats_0.5.1 stringr_1.4.0 dplyr_1.0.4 purrr_0.3.4
## [13] readr_1.4.0 tidyr_1.1.2 tibble_3.0.6 ggplot2_3.3.3
## [17] tidyverse_1.3.0 here_1.0.1
##
## loaded via a namespace (and not attached):
## [1] fs_1.5.0 lubridate_1.7.9.2 webshot_0.5.2
## [4] RColorBrewer_1.1-2 httr_1.4.2 rprojroot_2.0.2
## [7] UpSetR_1.4.0 tools_4.0.2 backports_1.2.1
## [10] R6_2.5.0 rpart_4.1-15 DBI_1.1.1
## [13] colorspace_2.0-0 nnet_7.3-14 withr_2.4.1
## [16] tidyselect_1.1.0 gridExtra_2.3 compiler_4.0.2
## [19] cli_2.3.0 rvest_0.3.6 htmlTable_2.1.0
## [22] xml2_1.3.2 labeling_0.4.2 bookdown_0.21
## [25] sass_0.3.1 scales_1.1.1 checkmate_2.0.0
## [28] commonmark_1.7 digest_0.6.27 foreign_0.8-80
## [31] rmarkdown_2.6 base64enc_0.1-3 jpeg_0.1-8.1
## [34] pkgconfig_2.0.3 htmltools_0.5.1.1 dbplyr_2.1.0
## [37] highr_0.8 htmlwidgets_1.5.3 rlang_0.4.10
## [40] readxl_1.3.1 rstudioapi_0.13 generics_0.1.0
## [43] farver_2.0.3 jsonlite_1.7.2 crosstalk_1.1.1
## [46] magrittr_2.0.1 Matrix_1.2-18 Rcpp_1.0.6
## [49] munsell_0.5.0 lifecycle_0.2.0 visdat_0.5.3
## [52] stringi_1.5.3 yaml_2.2.1 plyr_1.8.6
## [55] grid_4.0.2 crayon_1.4.1 haven_2.3.1
## [58] splines_4.0.2 hms_1.0.0 knitr_1.31
## [61] pillar_1.4.7 reprex_1.0.0 glue_1.4.2
## [64] evaluate_0.14 latticeExtra_0.6-29 data.table_1.13.6
## [67] modelr_0.1.8 png_0.1-7 vctrs_0.3.6
## [70] rmdformats_1.0.1 cellranger_1.1.0 gtable_0.3.0
## [73] assertthat_0.2.1 xfun_0.20 broom_0.7.4
## [76] viridisLite_0.3.0 cluster_2.1.0 ellipsis_0.3.1
Univariate distribution checks
This section reports a series of univariate summary checks of the CRASH-2 dataset.
Data set overview
Using the Hmisc describe function, we provide an overview of the data set. The descriptive report also provides histograms of continuous variables. For ease of scanning the information, we group the report by measurement type.
Demographic variables
2 Variables 20207 Observations
age: Age years
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 20203 | 4 | 84 | 0.999 | 34.56 | 15.55 | 18 | 19 | 24 | 30 | 43 | 55 | 64 |
sex: Sex
| n | missing | distinct |
|---|---|---|
| 20206 | 1 | 2 |
Value male female Frequency 16935 3271 Proportion 0.838 0.162
Physiological measurements
5 Variables 20207 Observations
sbp: Systolic Blood Pressure mmHg
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 19887 | 320 | 173 | 0.989 | 98.45 | 27.86 | 60 | 70 | 80 | 95 | 110 | 130 | 143 |
hr: Heart Rate /min
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 20070 | 137 | 173 | 0.996 | 104.5 | 23.38 | 70 | 80 | 90 | 105 | 120 | 130 | 140 |
rr: Respiratory Rate /min
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 20016 | 191 | 68 | 0.99 | 23.06 | 7.052 | 14 | 16 | 20 | 22 | 26 | 30 | 35 |
gcs: Glasgow Coma Score Total points
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 20184 | 23 | 13 | 0.863 | 12.47 | 3.594 | 4 | 6 | 11 | 15 | 15 | 15 | 15 |
Value 3 4 5 6 7 8 9 10 11 12 13 14
Frequency 784 520 441 584 733 576 504 663 586 951 1356 2140
Proportion 0.039 0.026 0.022 0.029 0.036 0.029 0.025 0.033 0.029 0.047 0.067 0.106
Value 15
Frequency 10346
Proportion 0.513
cc: Central Capillary Refille Time s
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 19596 | 611 | 20 | 0.945 | 3.267 | 1.67 | 1 | 2 | 2 | 3 | 4 | 5 | 6 |
Value 1 2 3 4 5 6 7 8 9 10 11 12
Frequency 1510 5328 6020 3367 1805 802 268 271 45 139 3 7
Proportion 0.077 0.272 0.307 0.172 0.092 0.041 0.014 0.014 0.002 0.007 0.000 0.000
Value 13 15 16 17 18 20 30 60
Frequency 3 19 3 1 1 2 1 1
Proportion 0.000 0.001 0.000 0.000 0.000 0.000 0.000 0.000
Characteristics of injury
2 Variables 20207 Observations
injurytime: Hours Since Injury hours
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 20196 | 11 | 93 | 0.972 | 2.844 | 2.35 | 0.5 | 1.0 | 1.0 | 2.0 | 4.0 | 6.0 | 7.0 |
injurytype: Injury type
| n | missing | distinct |
|---|---|---|
| 20207 | 0 | 3 |
Value blunt penetrating blunt and penetrating Frequency 11189 6552 2466 Proportion 0.554 0.324 0.122
Categorical variables
We now provide a closer visual examination of the categorical predictors.
Categorical ordinal plots
The Glasgow coma score, an ordinal categorical variable, is also displayed separately.
Continuous variables
A closer visual examination of continuous predictors.
There is evidence of digit preference. Explore further with targeted summaries. A more detailed univariate summaries for the variables of interest are also provided below.
Age
Distribution of subject age [years]
Five patients under the age of 17, the inclusion criteria for the study, with one patient aged 1.
Blood pressure
Distribution of SBP
Respiratory rate
Distribution of respiratory rate
Heart rate
Distribution of heart rate
Central capillary refill time
Distribution of Central capillary refill time
Hours since injury
Distribution of hours since injury
Section session info
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18363)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] Hmisc_4.4-2 Formula_1.2-4 survival_3.1-12 lattice_0.20-41
## [5] forcats_0.5.1 stringr_1.4.0 dplyr_1.0.4 purrr_0.3.4
## [9] readr_1.4.0 tidyr_1.1.2 tibble_3.0.6 ggplot2_3.3.3
## [13] tidyverse_1.3.0 here_1.0.1
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.6 lubridate_1.7.9.2 png_0.1-7
## [4] assertthat_0.2.1 rprojroot_2.0.2 digest_0.6.27
## [7] R6_2.5.0 cellranger_1.1.0 backports_1.2.1
## [10] reprex_1.0.0 evaluate_0.14 highr_0.8
## [13] httr_1.4.2 pillar_1.4.7 rlang_0.4.10
## [16] readxl_1.3.1 data.table_1.13.6 rstudioapi_0.13
## [19] rpart_4.1-15 Matrix_1.2-18 checkmate_2.0.0
## [22] rmarkdown_2.6 labeling_0.4.2 splines_4.0.2
## [25] foreign_0.8-80 htmlwidgets_1.5.3 munsell_0.5.0
## [28] broom_0.7.4 compiler_4.0.2 modelr_0.1.8
## [31] xfun_0.20 pkgconfig_2.0.3 base64enc_0.1-3
## [34] htmltools_0.5.1.1 nnet_7.3-14 tidyselect_1.1.0
## [37] htmlTable_2.1.0 gridExtra_2.3 bookdown_0.21
## [40] crayon_1.4.1 dbplyr_2.1.0 withr_2.4.1
## [43] grid_4.0.2 jsonlite_1.7.2 gtable_0.3.0
## [46] lifecycle_0.2.0 DBI_1.1.1 magrittr_2.0.1
## [49] scales_1.1.1 rmdformats_1.0.1 cli_2.3.0
## [52] stringi_1.5.3 farver_2.0.3 fs_1.5.0
## [55] latticeExtra_0.6-29 xml2_1.3.2 ellipsis_0.3.1
## [58] generics_0.1.0 vctrs_0.3.6 RColorBrewer_1.1-2
## [61] tools_4.0.2 glue_1.4.2 hms_1.0.0
## [64] jpeg_0.1-8.1 yaml_2.2.1 colorspace_2.0-0
## [67] cluster_2.1.0 rvest_0.3.6 knitr_1.31
## [70] haven_2.3.1 patchwork_1.1.1
Multivariate distributions
Overview
Variable correlation
corrs <- a_crash2 %>%
dplyr::select(age, sex, sbp, hr, rr ,cc, injurytime, injurytype ) %>%
filter(complete.cases(.)) %>%
dplyr::mutate_all(as.numeric)
M <- cor(corrs)
col <- colorRampPalette(c("#BB4444", "#EE9988", "#FFFFFF", "#77AADD", "#4477AA"))
corrplot(M, method = "color", col = col(200),
type = "upper", order = "hclust", number.cex = .7,
addCoef.col = "black", # Add coefficient of correlation
tl.col = "black", tl.srt = 90, # Text label color and rotation
# hide correlation coefficient on the principal diagonal
diag = FALSE)Variable clustering
Variable clustering is used for assessing collinearity, redundancy, and for separating variables into clusters that can be scored as a single variable, thus resulting in data reduction.
Hmisc::varclus( ~ age + sbp + hr + rr + cc + gcs + injurytime + injurytype + sex, data = a_crash2)## Hmisc::varclus(x = ~age + sbp + hr + rr + cc + gcs + injurytime +
## injurytype + sex, data = a_crash2)
##
##
## Similarity matrix (Spearman rho^2)
##
## age sbp hr rr cc gcs injurytime
## age 1.00 0.00 0.00 0.00 0.00 0.00 0.01
## sbp 0.00 1.00 0.11 0.03 0.07 0.01 0.01
## hr 0.00 0.11 1.00 0.05 0.02 0.02 0.00
## rr 0.00 0.03 0.05 1.00 0.02 0.00 0.00
## cc 0.00 0.07 0.02 0.02 1.00 0.02 0.00
## gcs 0.00 0.01 0.02 0.00 0.02 1.00 0.01
## injurytime 0.01 0.01 0.00 0.00 0.00 0.01 1.00
## injurytypepenetrating 0.02 0.00 0.01 0.00 0.00 0.06 0.05
## injurytypeblunt and penetrating 0.00 0.01 0.01 0.00 0.00 0.01 0.00
## sexfemale 0.01 0.00 0.00 0.00 0.00 0.00 0.00
## injurytypepenetrating
## age 0.02
## sbp 0.00
## hr 0.01
## rr 0.00
## cc 0.00
## gcs 0.06
## injurytime 0.05
## injurytypepenetrating 1.00
## injurytypeblunt and penetrating 0.07
## sexfemale 0.02
## injurytypeblunt and penetrating sexfemale
## age 0.00 0.01
## sbp 0.01 0.00
## hr 0.01 0.00
## rr 0.00 0.00
## cc 0.00 0.00
## gcs 0.01 0.00
## injurytime 0.00 0.00
## injurytypepenetrating 0.07 0.02
## injurytypeblunt and penetrating 1.00 0.00
## sexfemale 0.00 1.00
##
## No. of observations used for each pair:
##
## age sbp hr rr cc gcs injurytime
## age 20203 19884 20066 20012 19593 20180 20193
## sbp 19884 19887 19795 19750 19316 19883 19877
## hr 20066 19795 20070 19943 19482 20066 20059
## rr 20012 19750 19943 20016 19454 20014 20008
## cc 19593 19316 19482 19454 19596 19595 19588
## gcs 20180 19883 20066 20014 19595 20184 20173
## injurytime 20193 19877 20059 20008 19588 20173 20196
## injurytypepenetrating 20203 19887 20070 20016 19596 20184 20196
## injurytypeblunt and penetrating 20203 19887 20070 20016 19596 20184 20196
## sexfemale 20202 19886 20069 20015 19595 20183 20195
## injurytypepenetrating
## age 20203
## sbp 19887
## hr 20070
## rr 20016
## cc 19596
## gcs 20184
## injurytime 20196
## injurytypepenetrating 20207
## injurytypeblunt and penetrating 20207
## sexfemale 20206
## injurytypeblunt and penetrating sexfemale
## age 20203 20202
## sbp 19887 19886
## hr 20070 20069
## rr 20016 20015
## cc 19596 19595
## gcs 20184 20183
## injurytime 20196 20195
## injurytypepenetrating 20207 20206
## injurytypeblunt and penetrating 20207 20206
## sexfemale 20206 20206
##
## hclust results (method=complete)
##
##
## Call:
## hclust(d = as.dist(1 - x), method = method)
##
## Cluster method : complete
## Number of objects: 10
Plot associations.
plot(Hmisc::varclus( ~ age + sbp + hr + rr + cc + gcs + injurytime + injurytype + sex, data = a_crash2))Variable redundancy
Redundancy analysis of predictor variables.
Hmisc::redun( ~ hr + rr + age + sbp + injurytype + sex , data = a_crash2)##
## Redundancy Analysis
##
## Hmisc::redun(formula = ~hr + rr + age + sbp + injurytype + sex,
## data = a_crash2)
##
## n: 19689 p: 6 nk: 3
##
## Number of NAs: 518
## Frequencies of Missing Values Due to Each Variable
## hr rr age sbp injurytype sex
## 137 191 4 320 0 1
##
##
## Transformation of target variables forced to be linear
##
## R-squared cutoff: 0.9 Type: ordinary
##
## R^2 with which each variable can be predicted from all other variables:
##
## hr rr age sbp injurytype sex
## 0.116 0.044 0.052 0.099 0.061 0.035
##
## No redundant variables
Summary reports by sex
Overall
| Baseline characteristics by sex. | |||
| N |
male N=16935 |
female N=3271 |
|
|---|---|---|---|
Age years |
20203 | 23.0 30.0 41.0 33.7 ± 13.6 |
25.0 35.0 50.0 38.8 ± 16.8 |
Systolic Blood Pressure mmHg |
19887 | 80.0 95.0 110.0 98.8 ± 25.5 |
80.0 90.0 110.0 96.7 ± 25.7 |
Heart Rate /min |
20070 | 90.0 105.0 120.0 104.3 ± 21.2 |
92.0 106.0 120.0 105.2 ± 21.0 |
Respiratory Rate /min |
20016 | 20.00 22.00 26.00 23.07 ± 6.77 |
20.00 22.00 26.00 23.03 ± 6.58 |
Central Capillary Refille Time s |
19596 | 2.00 3.00 4.00 3.27 ± 1.72 |
2.00 3.00 4.00 3.23 ± 1.59 |
Glasgow Coma Score Total points |
20184 | 11.00 15.00 15.00 12.44 ± 3.72 |
12.00 14.00 15.00 12.62 ± 3.46 |
Hours Since Injury hours |
20196 | 1.00 2.00 4.00 2.85 ± 2.39 |
1.00 2.00 4.00 2.84 ± 2.67 |
| Injury type : blunt | 20207 | 0.53 8962/16935 | 0.68 2227/ 3271 |
| penetrating | 0.35 5930/16935 | 0.19 621/ 3271 | |
| blunt and penetrating | 0.12 2043/16935 | 0.13 423/ 3271 | |
| a b c represent the lower quartile a, the median b, and the upper quartile c for continuous variables. x ± s represents X ± 1 SD. N is the number of non-missing values. | |||
Distribution of age by sex
Distribution of age by sex
Distribution of systolic blood pressure by sex
Distribution of systolic blood pressure by sex
Distribution of heart rate by sex
Distribution of heart rate by sex
Distribution of respiratory rate by sex
Distribution of respiratory rate by sex
Distribution of central capillary refille time by sex
Distribution of central capillary refill time by sex
Distribution of hours since injury by sex
Distribution of hours since injury by sex
Distribution of Glasgow coma score by sex
Distribution of Glasgow coma score (point scale) by sex
Distribution of injury type by sex
Distribution of injury type by sex
Summary reports by age
Categorize age for the purposes of exploring the relationship between age and other baseline variables. This is purely for exploratory purposes only, and not to influence the analysis strategy by pursuing the dichotomization of age.
| Characteristic | N = 20,2071 |
|---|---|
| age_C | |
| <30 | 9,070 (45%) |
| 30-44 | 6,477 (32%) |
| 45-59 | 3,204 (16%) |
| 60+ | 1,452 (7.2%) |
| NA | 4 (<0.1%) |
|
1
n (%)
|
|
Report all variables by age category.
| Baseline characteristics by age categories. | |||||
| N |
<30 N=9070 |
30-44 N=6477 |
45-59 N=3204 |
60+ N=1452 |
|
|---|---|---|---|---|---|
| Sex : female | 20202 | 0.13 1183/9070 | 0.15 959/6476 | 0.21 659/3204 | 0.32 469/1452 |
Systolic Blood Pressure mmHg |
19884 | 80.0 96.0 110.0 98.1 ± 23.8 |
80.0 90.0 110.0 97.7 ± 25.3 |
80.0 94.0 112.0 100.1 ± 28.4 |
80.0 90.0 110.0 100.4 ± 30.2 |
Heart Rate /min |
20066 | 91.0 106.0 120.0 105.3 ± 21.3 |
90.0 106.0 120.0 104.7 ± 20.9 |
90.0 104.0 120.0 103.3 ± 21.0 |
88.0 100.0 116.0 101.0 ± 21.8 |
Respiratory Rate /min |
20012 | 20.00 22.00 26.00 22.93 ± 6.74 |
20.00 22.00 26.00 23.24 ± 6.68 |
20.00 22.00 26.00 23.11 ± 6.80 |
20.00 22.00 26.00 23.04 ± 6.89 |
Central Capillary Refille Time s |
19593 | 2.00 3.00 4.00 3.20 ± 1.77 |
2.00 3.00 4.00 3.27 ± 1.65 |
2.00 3.00 4.00 3.34 ± 1.64 |
2.00 3.00 4.00 3.48 ± 1.56 |
Glasgow Coma Score Total points |
20180 | 11.00 15.00 15.00 12.64 ± 3.61 |
11.00 14.50 15.00 12.39 ± 3.72 |
11.00 14.00 15.00 12.38 ± 3.70 |
10.00 14.00 15.00 12.00 ± 3.82 |
Hours Since Injury hours |
20193 | 1.00 2.00 4.00 2.71 ± 2.18 |
1.00 2.00 4.00 2.83 ± 2.28 |
1.00 2.50 4.50 3.12 ± 3.17 |
1.00 3.00 4.50 3.12 ± 2.68 |
| Injury type : blunt | 20203 | 0.50 4544/9070 | 0.53 3462/6477 | 0.65 2081/3204 | 0.76 1101/1452 |
| penetrating | 0.38 3448/9070 | 0.33 2155/6477 | 0.23 748/3204 | 0.14 199/1452 | |
| blunt and penetrating | 0.12 1078/9070 | 0.13 860/6477 | 0.12 375/3204 | 0.10 152/1452 | |
| a b c represent the lower quartile a, the median b, and the upper quartile c for continuous variables. x ± s represents X ± 1 SD. N is the number of non-missing values. | |||||
Distribution of systolic blood pressure by age categories
Distribution of systolic blood pressure by gcs
Distribution of heart rate by age categories
Distribution of heart rate by gcs
Distribution of respiratory rate by age categories
Distribution of respiratory rate by gcs
Distribution of central capillary refille time by age categories
Distribution of central capillary refill time by gcs
WIP: multivariate scatter plots
a_crash2 %>% dplyr::filter(!is.na(sbp)) %>% tally()## n
## 1 19887
a_crash2 %>% dplyr::filter(is.na(sbp)) %>% tally()## n
## 1 320
bigN <- a_crash2 %>% dplyr::filter(!is.na(sbp) & !is.na(age)) %>% tally()
n_miss <- a_crash2 %>% dplyr::filter(is.na(sbp) | is.na(age)) %>% tally()
title <-
paste0("Plot of ", Hmisc::label(a_crash2$age), " and ", Hmisc::label(a_crash2$sbp))
caption <-
paste0(
"n = ",
bigN,
" subjects displayed.\n",
n_miss,
" subjects with a missing value in at least one of the variables."
)
x_axis <- paste0(Hmisc::label(a_crash2$age), " [", Hmisc::units(a_crash2$age), "]")
y_axis <- paste0(Hmisc::label(a_crash2$sbp), " [", Hmisc::units(a_crash2$sbp), "]")
p1 <- a_crash2 %>%
dplyr::filter(!is.na(sbp) & !is.na(age)) %>%
mutate(sbp = as.numeric(sbp),
age = as.numeric(age)) %>%
ggplot(aes(x = sbp, y = age)) +
ylab(x_axis) +
xlab(y_axis) +
labs(
title = title,
caption = caption
) +
geom_point(shape = 16, #size = 0.5,
alpha = 0.5,
color = "firebrick2") +
geom_rug() +
theme_minimal()
p1WIP: Scatter plots with a third or fourth variable
Scatter plot of age and RR by sex and injury type.
Scatter plot of SBP and RR by sex and injury type.
Summary reports by Glasgow coma score
| Baseline characteristics by Glasgow coma score. | ||||||||||||||
| N |
3 N=784 |
4 N=520 |
5 N=441 |
6 N=584 |
7 N=733 |
8 N=576 |
9 N=504 |
10 N=663 |
11 N=586 |
12 N=951 |
13 N=1356 |
14 N=2140 |
15 N=10346 |
|
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Age years |
20203 | 24.0 32.0 44.0 35.5 ± 14.9 |
25.0 33.0 44.0 35.5 ± 14.1 |
24.0 32.0 45.0 35.4 ± 14.7 |
23.0 31.0 45.0 35.4 ± 15.4 |
23.0 30.0 42.0 33.9 ± 14.0 |
24.0 32.0 45.0 35.7 ± 15.0 |
24.0 32.0 44.0 35.5 ± 14.6 |
24.0 31.0 42.0 34.4 ± 13.8 |
24.0 33.0 46.0 36.6 ± 15.6 |
25.0 32.0 45.0 35.9 ± 14.3 |
25.0 33.0 45.0 36.4 ± 15.0 |
24.0 31.0 44.0 35.1 ± 14.7 |
23.0 30.0 41.0 33.7 ± 13.8 |
Heart Rate /min |
20070 | 90.0 112.0 128.0 106.9 ± 31.3 |
95.0 114.0 130.0 110.8 ± 29.2 |
98.0 110.5 130.0 111.4 ± 25.4 |
90.0 110.0 123.2 106.2 ± 24.4 |
95.0 109.0 120.0 107.1 ± 23.0 |
95.0 110.0 120.0 107.4 ± 24.0 |
92.0 109.0 120.0 105.5 ± 21.9 |
96.0 110.0 124.8 108.4 ± 24.0 |
96.0 110.0 122.0 107.8 ± 20.4 |
100.0 110.0 122.0 109.3 ± 20.2 |
96.0 108.0 120.0 106.5 ± 20.1 |
92.0 105.0 120.0 104.5 ± 19.9 |
90.0 100.0 115.0 102.0 ± 18.9 |
Respiratory Rate /min |
20016 | 12.00 20.00 28.00 20.67 ± 10.74 |
16.00 22.00 28.00 22.22 ± 9.14 |
18.00 22.00 28.00 22.89 ± 8.69 |
18.00 21.00 26.00 22.12 ± 7.56 |
18.00 20.00 26.00 21.97 ± 7.69 |
18.00 22.00 28.00 23.11 ± 7.73 |
20.00 24.00 28.00 23.23 ± 6.99 |
19.00 22.00 28.00 23.05 ± 6.73 |
20.00 23.00 28.00 23.45 ± 6.37 |
20.00 24.00 28.00 24.32 ± 6.41 |
20.00 22.00 27.00 23.45 ± 6.53 |
20.00 22.00 26.00 23.41 ± 6.09 |
20.00 22.00 26.00 23.14 ± 6.07 |
Systolic Blood Pressure mmHg |
19887 | 70.0 85.0 103.0 88.7 ± 33.7 |
78.0 90.0 116.0 96.5 ± 31.2 |
80.0 90.0 118.0 99.0 ± 30.7 |
80.0 100.0 127.0 104.3 ± 32.1 |
80.0 100.0 130.0 105.4 ± 30.6 |
80.0 90.0 115.0 99.2 ± 29.4 |
80.0 96.0 120.0 99.6 ± 28.9 |
80.0 90.0 110.0 92.6 ± 28.0 |
80.0 90.0 110.0 94.4 ± 26.4 |
71.0 90.0 100.0 88.4 ± 24.7 |
80.0 90.0 110.0 95.9 ± 23.5 |
80.0 90.0 110.0 96.4 ± 22.8 |
90.0 100.0 110.0 100.5 ± 23.1 |
Central Capillary Refille Time s |
19596 | 3.00 4.00 5.00 4.15 ± 2.13 |
3.00 4.00 5.00 3.84 ± 1.90 |
2.00 3.00 5.00 3.76 ± 1.91 |
2.00 3.00 4.00 3.49 ± 1.64 |
2.00 3.00 4.00 3.28 ± 1.55 |
2.00 3.00 4.00 3.52 ± 1.69 |
2.00 3.00 4.00 3.40 ± 3.00 |
2.00 3.00 4.00 3.37 ± 1.66 |
2.00 3.00 4.00 3.27 ± 1.51 |
3.00 3.00 4.00 3.53 ± 1.60 |
2.00 3.00 4.00 3.40 ± 1.69 |
2.00 3.00 4.00 3.31 ± 1.73 |
2.00 3.00 4.00 3.06 ± 1.54 |
| Sex : female | 20206 | 0.14 107/ 784 | 0.13 68/ 520 | 0.12 53/ 441 | 0.16 92/ 584 | 0.14 100/ 733 | 0.15 89/ 576 | 0.15 74/ 504 | 0.19 124/ 663 | 0.17 97/ 586 | 0.21 198/ 951 | 0.20 270/ 1356 | 0.18 391/ 2139 | 0.16 1604/10346 |
Hours Since Injury hours |
20196 | 1.00 2.00 4.00 2.54 ± 1.94 |
1.00 3.00 5.00 3.26 ± 2.20 |
1.00 3.00 5.00 3.42 ± 2.19 |
2.00 3.75 6.00 3.75 ± 2.31 |
2.00 3.00 5.00 3.62 ± 2.20 |
1.00 3.00 5.00 3.30 ± 2.17 |
1.00 3.00 5.00 3.12 ± 2.20 |
1.00 2.00 4.00 3.03 ± 2.19 |
1.00 2.50 4.00 3.01 ± 2.05 |
1.00 2.00 4.00 2.75 ± 2.03 |
1.00 2.00 4.00 2.79 ± 1.97 |
1.00 2.00 4.00 2.64 ± 1.99 |
1.00 2.00 4.00 2.71 ± 2.69 |
| Injury type : blunt | 20207 | 0.62 483/ 784 | 0.71 371/ 520 | 0.73 324/ 441 | 0.76 443/ 584 | 0.76 559/ 733 | 0.69 399/ 576 | 0.67 338/ 504 | 0.61 407/ 663 | 0.64 377/ 586 | 0.58 550/ 951 | 0.60 814/ 1356 | 0.58 1237/ 2140 | 0.47 4880/10346 |
| penetrating | 0.22 175/ 784 | 0.10 53/ 520 | 0.09 41/ 441 | 0.10 59/ 584 | 0.11 77/ 733 | 0.15 89/ 576 | 0.17 88/ 504 | 0.23 151/ 663 | 0.21 123/ 586 | 0.29 272/ 951 | 0.24 326/ 1356 | 0.29 629/ 2140 | 0.43 4458/10346 | |
| blunt and penetrating | 0.16 126/ 784 | 0.18 96/ 520 | 0.17 76/ 441 | 0.14 82/ 584 | 0.13 97/ 733 | 0.15 88/ 576 | 0.15 78/ 504 | 0.16 105/ 663 | 0.15 86/ 586 | 0.14 129/ 951 | 0.16 216/ 1356 | 0.13 274/ 2140 | 0.10 1008/10346 | |
| a b c represent the lower quartile a, the median b, and the upper quartile c for continuous variables. x ± s represents X ± 1 SD. N is the number of non-missing values. | ||||||||||||||
Distribution of age by Glasgow coma score
Distribution of age by gcs
Distribution of systolic blood pressure by Glasgow coma score
Distribution of systolic blood pressure by gcs
Distribution of heart rate by Glasgow coma score
Distribution of heart rate by gcs
Distribution of respiratory rate by Glasgow coma score
Distribution of respiratory rate by GCS
Distribution of central capillary refille time by Glasgow coma score
Distribution of central capillary refill time by GCS
Section session info
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18363)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] patchwork_1.1.1 corrplot_0.84 gtsummary_1.3.6 Hmisc_4.4-2
## [5] Formula_1.2-4 survival_3.1-12 lattice_0.20-41 plotly_4.9.3
## [9] forcats_0.5.1 stringr_1.4.0 dplyr_1.0.4 purrr_0.3.4
## [13] readr_1.4.0 tidyr_1.1.2 tibble_3.0.6 ggplot2_3.3.3
## [17] tidyverse_1.3.0 here_1.0.1
##
## loaded via a namespace (and not attached):
## [1] fs_1.5.0 usethis_2.0.1 lubridate_1.7.9.2
## [4] RColorBrewer_1.1-2 httr_1.4.2 rprojroot_2.0.2
## [7] tools_4.0.2 backports_1.2.1 R6_2.5.0
## [10] rpart_4.1-15 DBI_1.1.1 lazyeval_0.2.2
## [13] colorspace_2.0-0 nnet_7.3-14 withr_2.4.1
## [16] tidyselect_1.1.0 gridExtra_2.3 compiler_4.0.2
## [19] cli_2.3.0 rvest_0.3.6 gt_0.2.2
## [22] htmlTable_2.1.0 xml2_1.3.2 sass_0.3.1
## [25] labeling_0.4.2 bookdown_0.21 scales_1.1.1
## [28] checkmate_2.0.0 commonmark_1.7 digest_0.6.27
## [31] foreign_0.8-80 rmarkdown_2.6 base64enc_0.1-3
## [34] jpeg_0.1-8.1 pkgconfig_2.0.3 htmltools_0.5.1.1
## [37] dbplyr_2.1.0 highr_0.8 htmlwidgets_1.5.3
## [40] rlang_0.4.10 readxl_1.3.1 rstudioapi_0.13
## [43] generics_0.1.0 farver_2.0.3 jsonlite_1.7.2
## [46] crosstalk_1.1.1 magrittr_2.0.1 Matrix_1.2-18
## [49] Rcpp_1.0.6 munsell_0.5.0 lifecycle_0.2.0
## [52] stringi_1.5.3 yaml_2.2.1 grid_4.0.2
## [55] crayon_1.4.1 haven_2.3.1 splines_4.0.2
## [58] hms_1.0.0 knitr_1.31 pillar_1.4.7
## [61] reprex_1.0.0 glue_1.4.2 evaluate_0.14
## [64] latticeExtra_0.6-29 data.table_1.13.6 broom.helpers_1.1.0
## [67] modelr_0.1.8 png_0.1-7 vctrs_0.3.6
## [70] rmdformats_1.0.1 cellranger_1.1.0 gtable_0.3.0
## [73] assertthat_0.2.1 xfun_0.20 broom_0.7.4
## [76] viridisLite_0.3.0 cluster_2.1.0 ellipsis_0.3.1
NHANES
Introduction to NHANES
Since a key principle of IDA is not to touch the research questions, before IDA commences the research aim and statistical analysis plan need to be in place. IDA may lead to an update or refinement of the analysis plan. To demonstrate the workflow and content of IDA, we created a hypothetical research aim and corresponding statistical analysis plan, which is described in more detail in the section nhanes_SAP.Rmd.
Hypothetical research aim for IDA is to develop a multivariable model for MVPA (minutes of moderate/vigorous physical activity) with primary aim of variable selection to predict MVPA and secondary aim to study the role of systolic blood pressure in addition to variables identified. MVPA can be used to examine factors distinguishing very active participants with large amounts of time spent on MVPA from others (using untransformed data) or distinguishing participants according to percentage changes in MVPA (logarithmic scale) thus de-emphasizing extreme values.
NHANES Description
The National Health and Nutrition Examination Survey (NHANES) is a program of studies designed to assess the health and nutritional status of adults and children in the United States. The survey examines a nationally representative sample of non-institutionalized US civilians using a multistage probability sampling design that considers geographical area and minority representation. Sample weights are generated to create nationally representative estimates for the US population and subgroups defined by age, sex, and race/ethnicity. Link to CDC NHANES website. NHANES collects data on various health and behavior indicators, including physical activity and self‐reported diagnosis of prevalent health conditions such as diabetes mellitus, coronary artery disease, stroke, and cancer.
Physical activity was measured with a waist‐worn uniaxial accelerometer (AM‐7164; ActiGraph) for up to 7 days. Participants were asked to wear the devie while awake except when simming or bathing. Data were cleaned according to calibration specification and nonwear time defined by an interval of at least 60 consecutive minutes of zero activity intensity counts. Days with fewer than 10 hours of wear time were excluded and participants with at least 1 valid day of accelerometer data were included in the analysis. Mean counts per minute were calculated by dividing the sum of activity counts for a valid day by the number of minutes of wear time in that day across all valid days. (Troiano 2008)
Moderate or vigorous intensity was based on count thresholds. Time spent in such activities was determined by summing minutes in a day where the count met the criterion for that intensity.(Troiano 2008)
The NHANES 2003–2004 and 2005–2006 have a total of 14,631 participants with accelerometry data. Participants aged 30 to 85 at the time they wore the accelerometer are included. Other inclusion criteria are in line with the choices for the prediction model of 5 year mortality presented by Smirnova et al, J Gerontol A Biol Sci Med Sci 2020. The preparation of the data was based on “Organizing and Analyzing the Activity Data in NHANES” Leroux et al, Statistics in Biosciences 2019. High quality processed activity data combined with mortality and demographic information can be downloaded and used in R with code from Andrew Leroux (https://andrew-leroux.github.io/rnhanesdata/articles).
Preparations
High quality processed activity data combined with mortality and demographic information can be downloaded and used in R with code from Andrew Leroux (https://andrew-leroux.github.io/rnhanesdata/articles). The R code was modified to have fewer exclusions criteria as noted below.
Re-level comorbidities to assign refused/don’t know as not having the condition
Re-level education to have 3 levels and categorize don’t know/refused to be missing
Re-level alcohol consumption to include a level set to missing
Removed the “bad” days from Act_Analysis and Act_Flags
Systolic blood pressure is the mean of the non-missing of four blood pressure variables
Following Smirnova et al, participants were excluded who
- had fewer than 3 days of data with at least 10 hours of estimated wear time or were deemed by NHANES to have poor quality data; non-wear periods were identified as intervals with at least 60 consecutive minutes of zero activity counts and at most 2 minutes with counts between 0 and 100;
- missing mortality information or accidental death;
- alive with follow up less than 1 year
The NHANES dataset used in this project contains 6680 participants.
- For the purposes of this IDA project, in contrast to Smirnova et al, we did not exclude participants who
- had missing body mass index (BMI) or education predictor variables ;
- had missing systolic blood pressure, total or high-density lipoproteins (HDL) cholesterol measurements. The final data set in Smirnova et al contained 2,978 participants.
Sources
Leroux A. Vignettes for downloading and working with NHANES 2003-2004 and 2005-2006 accelerometry data https://andrew-leroux.github.io/rnhanesdata/articles/
To download the analysis data set, click the link to data set —GITHUB
Data dictionary
The data dictionary can be found LINK —- GITHUB
References
Troiano RP, Berrigan D, Dodd KW, Mâsse LC, Tilert T, McDowell M. Physical activity in the United States measured by accelerometer. Med Sci Sports Exerc. 2008 Jan;40(1):181-8. doi: 10.1249/mss.0b013e31815a51b3. PMID: 18091006.
Leroux A, Di J, Smirnova E, Mcguffey E, Cao Q, Bayatmokhtari E, Tabacu L, Zipunnikov V, Urbanek JK, Crainiceanu C. Organizing and Analyzing the Activity Data in NHANES. Stat Biosci 11, 262–287 (2019). https://doi-org.proxy1.cl.msu.edu/10.1007/s12561-018-09229-9
Smirnova E, Leroux A, Tabacu L, Zipunnikov V, Crainiceanu C, Urbanek JK. The Predictive Performance of Objective Measures of Physical Activity Derived From Accelerometry Data for 5-Year All-Cause Mortality in Older Adults: National Health and Nutritional Examination Survey 2003–2006, The Journals of Gerontology: Series A, Volume 75, Issue 9, September 2020, Pages 1779–1785, https://doi.org/10.1093/gerona/glz193
NHANES dataset contents
Source dataset
We refer to the source data set as the dataset available online here
Display the source dataset contents. This dataset is in the data-raw folder of the project directory.
Data frame:nhanesdat
6680 observations and 58 variables, maximum # NAs:5529| Name | Levels | Storage | NAs |
|---|---|---|---|
| seqn | integer | 0 | |
| paxcal | integer | 0 | |
| paxstat | integer | 0 | |
| weekday | integer | 0 | |
| sddsrvyr | double | 0 | |
| eligstat | integer | 0 | |
| mortstat | integer | 9 | |
| permth.exm | integer | 9 | |
| sdmvpsu | double | 0 | |
| sdmvstra | double | 0 | |
| wtint2yr | double | 0 | |
| wtmec2yr | double | 0 | |
| ridagemn | double | 0 | |
| ridageex | double | 0 | |
| ridageyr | double | 0 | |
| bmi | double | 56 | |
| bmi.cat | 4 | integer | 56 |
| race | 5 | integer | 0 |
| gender | 2 | integer | 0 |
| diabetes | 2 | integer | 0 |
| chf | 2 | integer | 0 |
| chd | 2 | integer | 0 |
| cancer | 2 | integer | 0 |
| stroke | 2 | integer | 0 |
| educationadult | 3 | integer | 7 |
| mobilityproblem | 2 | integer | 0 |
| drinkstatus | 4 | integer | 0 |
| drinksperweek | double | 466 | |
| smokecigs | 3 | integer | 4 |
| bpxsy1 | double | 972 | |
| bpxsy2 | double | 1224 | |
| bpxsy3 | double | 1296 | |
| bpxsy4 | double | 5529 | |
| lbxtc | double | 270 | |
| lbdhdd | double | 270 | |
| age | double | 0 | |
| sys | double | 320 | |
| tac | double | 708 | |
| tlac | double | 708 | |
| wt | double | 708 | |
| st | double | 708 | |
| mvpa | double | 708 | |
| about | double | 708 | |
| sbout | double | 708 | |
| satp | double | 708 | |
| astp | double | 708 | |
| tlac.1 | double | 708 | |
| tlac.2 | double | 708 | |
| tlac.3 | double | 708 | |
| tlac.4 | double | 708 | |
| tlac.5 | double | 708 | |
| tlac.6 | double | 708 | |
| tlac.7 | double | 708 | |
| tlac.8 | double | 708 | |
| tlac.9 | double | 708 | |
| tlac.10 | double | 708 | |
| tlac.11 | double | 708 | |
| tlac.12 | double | 708 |
| Variable | Levels |
|---|---|
| bmi.cat | Normal |
| Underweight | |
| Overweight | |
| Obese | |
| race | White |
| Mexican American | |
| Other Hispanic | |
| Black | |
| Other | |
| gender | Male |
| Female | |
| diabetes, chf, chd | No |
| cancer, stroke | Yes |
| educationadult | Less than high school |
| High school | |
| More than high school | |
| mobilityproblem | No Difficulty |
| Any Difficulty | |
| drinkstatus | Moderate Drinker |
| Non-Drinker | |
| Heavy Drinker | |
| Missing alcohol | |
| smokecigs | Never |
| Former | |
| Current |
Updated analysis dataset
Additional meta-data is added to the original source data set. We write this new modified data set back to the data folder after adding additional meta-data for the following variables:
- seqn - add label “respondent sequence number”
- gender - add label "gender’,
- age - add label “age” and unit “years”
- educationadult - add label “education level”
- permth.exm - add label “Person Months of Follow-up from MEC/Exam Date”
- mortstat - add label “Final mortality status”
- sys - add label “Systolic Blood pressure” and unit “mg/dl”
- lbxtc - add label “Total cholesterol” and unit “mg/dL”
- lbdhdd - add label “HDL cholesterol” and unit “mg/dL”
- smokecigs - add label “smoking status”
- drinkstatus - add label “alcohol consumption”
- bmi - add label “body mass index” and unit “kg/m2”
- diabetes - add label “diabetes”
- chf - add label “congestive heart failure”
- cancer - add label “cancer”
- stroke - add label “stroke”
- mobilityproblem - add label “’difficulties with mobility”
- tac - add label “total activity counts per day”
- tlac - add label “total log activity count (log(1+activity))”
- wt - add label “total accelerometer wear time” and unit “minutes”
- mvpa - add label “Moderate or vigorous physical activity” and unit “minutes”
At this stage we select the variables of interest to take in to the IDA phase by dropping variables we do not check in IDA.
As a cross check we display the contents again to ensure the additional data is added, and then write back the changes to the data folder in the file “data/a_nhanes.rda”.
Input object size: 1479216 bytes; 33 variables 6680 observations New object size: 1416624 bytes; 33 variables 6680 observations
Data frame:a_nhanes
6680 observations and 33 variables, maximum # NAs:708| Name | Labels | Units | Levels | Class | Storage | NAs |
|---|---|---|---|---|---|---|
| seqn | respondent sequence number | integer | integer | 0 | ||
| age | age | years | numeric | double | 0 | |
| gender | gender | 2 | integer | 0 | ||
| permth.exm | Person Months of Follow-up from MEC/Exam Date | integer | integer | 9 | ||
| mortstat | Final mortality status | integer | integer | 9 | ||
| educationadult | education level | 3 | integer | 7 | ||
| smokecigs | smoking status | 3 | integer | 4 | ||
| drinkstatus | alcohol consumption | 4 | integer | 0 | ||
| bmi | body mass index | kg/m2 | numeric | double | 56 | |
| diabetes | diabetes | 2 | integer | 0 | ||
| chf | congestive heart failure | 2 | integer | 0 | ||
| cancer | cancer | 2 | integer | 0 | ||
| stroke | stroke | 2 | integer | 0 | ||
| sys | Systolic blood pressure | mg/dl | integer | integer | 320 | |
| lbxtc | Total cholesterol | mg/dL | integer | integer | 270 | |
| lbdhdd | HDL cholesterol | mg/dL | integer | integer | 270 | |
| mobilityproblem | difficulties with mobility | 2 | integer | 0 | ||
| tac | total activity counts per day | numeric | double | 708 | ||
| tlac | total log activity count (log(1+activity)) | numeric | double | 708 | ||
| mvpa | Moderate or vigorous physical activity | minutes | numeric | double | 708 | |
| wt | total accelerometer wear time | minutes | numeric | double | 708 | |
| tlac.1 | total log actvity count 12:00AM-2:00AM | numeric | double | 708 | ||
| tlac.2 | total log actvity count 2:00AM-4:00AM | numeric | double | 708 | ||
| tlac.3 | total log actvity count 4:00AM-6:00AM | numeric | double | 708 | ||
| tlac.4 | total log actvity count 6:00AM-8:00AM | numeric | double | 708 | ||
| tlac.5 | total log actvity count 8:00AM-10:00AM | numeric | double | 708 | ||
| tlac.6 | total log actvity count 10:00AM-12:00PM | numeric | double | 708 | ||
| tlac.7 | total log actvity count 12:00PM-2:00PM | numeric | double | 708 | ||
| tlac.8 | total log actvity count 2:00PM-4:00PM | numeric | double | 708 | ||
| tlac.9 | total log actvity count 4:00PM-6:00PM | numeric | double | 708 | ||
| tlac.10 | total log actvity count 6:00PM-8:00PM | numeric | double | 708 | ||
| tlac.11 | total log actvity count 8:00PM-10:00PM | numeric | double | 708 | ||
| tlac.12 | total log actvity count 10:00PM-12:00AM | numeric | double | 708 |
| Variable | Levels |
|---|---|
| gender | Male |
| Female | |
| educationadult | Less than high school |
| High school | |
| More than high school | |
| smokecigs | Never |
| Former | |
| Current | |
| drinkstatus | Moderate Drinker |
| Non-Drinker | |
| Heavy Drinker | |
| Missing alcohol | |
| diabetes, chf | No |
| cancer, stroke | Yes |
| mobilityproblem | No Difficulty |
| Any Difficulty |
Section session info
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18363)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] Hmisc_4.4-2 Formula_1.2-4 survival_3.1-12 lattice_0.20-41
## [5] forcats_0.5.1 stringr_1.4.0 dplyr_1.0.4 purrr_0.3.4
## [9] readr_1.4.0 tidyr_1.1.2 tibble_3.0.6 ggplot2_3.3.3
## [13] tidyverse_1.3.0 here_1.0.1
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.6 lubridate_1.7.9.2 png_0.1-7
## [4] assertthat_0.2.1 rprojroot_2.0.2 digest_0.6.27
## [7] R6_2.5.0 cellranger_1.1.0 backports_1.2.1
## [10] reprex_1.0.0 evaluate_0.14 httr_1.4.2
## [13] pillar_1.4.7 rlang_0.4.10 readxl_1.3.1
## [16] data.table_1.13.6 rstudioapi_0.13 rpart_4.1-15
## [19] Matrix_1.2-18 checkmate_2.0.0 rmarkdown_2.6
## [22] splines_4.0.2 foreign_0.8-80 htmlwidgets_1.5.3
## [25] munsell_0.5.0 broom_0.7.4 compiler_4.0.2
## [28] modelr_0.1.8 xfun_0.20 pkgconfig_2.0.3
## [31] base64enc_0.1-3 htmltools_0.5.1.1 nnet_7.3-14
## [34] tidyselect_1.1.0 htmlTable_2.1.0 gridExtra_2.3
## [37] bookdown_0.21 crayon_1.4.1 dbplyr_2.1.0
## [40] withr_2.4.1 grid_4.0.2 jsonlite_1.7.2
## [43] gtable_0.3.0 lifecycle_0.2.0 DBI_1.1.1
## [46] magrittr_2.0.1 scales_1.1.1 rmdformats_1.0.1
## [49] cli_2.3.0 stringi_1.5.3 fs_1.5.0
## [52] latticeExtra_0.6-29 xml2_1.3.2 ellipsis_0.3.1
## [55] generics_0.1.0 vctrs_0.3.6 RColorBrewer_1.1-2
## [58] tools_4.0.2 glue_1.4.2 hms_1.0.0
## [61] jpeg_0.1-8.1 yaml_2.2.1 colorspace_2.0-0
## [64] cluster_2.1.0 rvest_0.3.6 knitr_1.31
## [67] haven_2.3.1
Statistical analysis plan
Since a key principle of IDA is not to touch the research questions, before IDA commences the research aim and statistical analysis plan needs to be in place. IDA may lead to an update or refinement of the analysis plan. To demonstrate the workflow and content of IDA, we created a hypothetical research aim and corresponding statistical analysis plan.
Hypothetical research aim for IDA: The primary aim is to develop a multivariable model for MVPA (minutes of moderate/vigorous physical activity) with primary aim of variable selection to predict MVPA. Specifcially, the role of gender and age will be investigated. A secondary aim to study the role of systolic blood pressure in addition to variables identified. MVPA can be used to examine factors distinguishing very active participants with large amounts of time spent on MVPA from others (using untransformed data) or distinguishing participants according to percentage changes in MVPA (logarithmic scale) thus de-emphasizing extreme values.
The inclusion criteria are in line with the choices for the prediction model of 5 year mortality presented by Smirnova et al, J Gerontol A Biol Sci Med Sci 2020.
Statistical methods
Linear regression models will be used to model MVPA. Explanatory variables are age, gender, education, smoking, alcohol consumption, BMI, comorbidities (cancer, CHF, stroke), cholesterol (total, HDL). Partial R-squared will be used to identify an appropriate set of variables to predict MVPA. A secondary aim is to study the role of systolic blood pressure on MVPA in a linear regression model with variables identified in the previous step.
Variables
Outcome variable
MVPA (total minutes of moderate/vigorous physical activity which is defined as more than 2020 counts per minute) (mvpa, minutes)
Sociodemographic variables
- age at examination (i.e. when participants wore the device) (
age, years) - gender (
gender, “Male” and “Female”) - race/ethnicity (non-Hispanic “White”, non-Hispanic “Black”, “Mexican American”, and “Other”)
- education (“Less than high school”, “High school” (high school graduate/general educational development [GED]), “More than high school” (some college, and college graduate)) (
educationadult) - 5 year mortality, NAs for individuals with follow up less than 5 years and alive (
yr5.mort) - Person Months of Follow-up from MEC/Exam Date (
permth.exm) (follow-up time in this cohort in years = permth.exm/12) - final mortality status (
mortstat, 0, 1, NAs for individuals with follow up less than 5 years and alive)
Health and behavior variables
- smoking status (Current, Former [those reporting quitting within the previous 6 months], and Never) (
smokecigs) - alcohol consumption (
drinkstatus) (Non-Drinker, Moderate Drinker, Heavy Drinker, Missing alcohol) - bmi (
bmi, kg/m2) - obesity (
bmi.cat, No-Yes) - diabetes (
diabetes) - congestive heart failure (
chf, No-Yes) - cancer (
cancer, No-Yes) - stroke (
stroke, No-Yes) - average systolic blood pressure using the 4 measurements per participant (
sys, mmHg) - Total cholesterol (
lbxtc, mg/dL) - HDL cholesterol (
lbdhdd, mg/dL) - difficulties with mobility (
mobilityproblem, “No Difficulty”, “Any Difficulty” = a positive response to difficulty walking a quarter-mile, difficulty climbing 10 stairs, or use of any special equipment to walk)
Physical activity data
Summary measures are calculated due to the large size of minute level accelerometer-derived physical activity data.
- total activity counts per day (
TAC/d) - total log activity count (
TLAClog(1+TAC)) - total minutes of moderate/vigorous physical activity (
MVPA) - total accelerometer wear time (
WT) - total log activity count summary measures (
tlac.1,tlac.2, …,tlac.12`) in each 2-hr window, i.e. 12AM-2AM, 2AM-4AM, 4AM-6AM, etc.
Initial data analysis strategy
1. Statistical analysis plan: as assumed above, the analysis strategy to answer the main research question has been prespecified. It comprises of the set of independent variables to be considered in a model, the outcome variable, and the analytical strategy to build the regression model.
SAP is listed above
2. Data dictionary and metadata: a detailed data dictionary should be available which informs about the meaning of each variable in context of the research question, the units of measurement, the possible levels in case of categorical variables, or admissible values. More generally, metadata, also refer to information about the research study protocol and data collection processes.
A data dictionary is available.
3. Domain expertise and pivotal covariates (‘very important predictors’, VIPs)
It has been shown that physcial activity declines with age and men report higher levels of activity than women. Age and gender, also defined in the research aim, are pivotal covariates.
Keadle S et al. Prevalence and trends in physical activity among older adults in the United States: A comparison across three national surveys. Prev Med. 2016 Aug; 89: 37–43. https://doi.org/10.1016/j.ypmed.2016.05.009
Clarke TC, Norris T, and Schiller JS. Early Release of Selected Estimates Based on Data From the 2018 National Health Interview Survey. https://www.cdc.gov/nchs/nhis/releases/released201905.htm#7a
3.2. Domain expertise may also be useful to specify in advance which variables are expected to correlate with each other. This background knowledge could be summarized in a directed or undirected acyclic graph connecting the covariates with each other as also suggested by Heinze et al, 2018.
3.3. Missing value mechanisms: if not already specified in meta data, domain experts should also be consulted to explain possible reasons for the occurrence of missing values for each variable, which may be categorized as systematic or unsystematic.
Missing values are expected due to the nature of survey research. Domain expertice would be helpful in identifying specific expectations. Missingness of some covariates may be associated with the outcome variable. This will be considered in the IDA domain ‘Missing values’ to identify approaches or updates to the SAP.
IDA domain: missing values
1. Number and proportion of missing values for each independent variable, for the dependent variable and for the analysis as a whole.
Number and proportion of missing values will be computed for all variables.
2. Patterns of missing values across all independent variables, either as tables or appropriately visualized.
We will create missing value indicators for each covariate and will then summarize patterns by means of a heat map and a dendogram.
3. Patterns of missing values associated with the outcome variable
This may need to change.
From Lee at al (STRATOS)
A table of the observed characteristics for the “complete” versus “incomplete” (or all) participants, or by whether variables with substantial missingness are observed.
An assessment of the predictors of missingness, e.g. using a logistic regression model fitted to an indicator for being a complete record, and predictors of missing values i.e. associations with the incomplete variables.
IDA domain: univariate distributions
3. For categorical variables (including the dependent variable): frequency and proportion for each category.
Demographic variables, smoking status, alcohol consumption, comorbidities, and mortality status will be described by frequencies and proportions.
4. For continuous variables (including the dependent variable): high-resolution histogram, summary of main percentiles (1st, 10th, 25th, 50th, 75th, 90th, 99th) and interquartile range, 5 highest and 5 lowest values, first four moments (mean, variance, skewness, curtosis), standard deviation, number of distinct values.
Summaries for all continuous variables (age, BMI, physiological variables, physical activity) will be created to depict their marginal distributions by means of high-resolution histograms. Furthermore, each continuous variable will be described by 1st, 10th, 25th, 50th, 75th, 90th, 99th percentiles, interquartile range, the 5 highest and 5 lowest values, the first four moments (mean, variance, skewness, curtosis), standard deviation, and the number of distinct values.
While the outcome variable MVPA is the only physical activity variable in the analysis plan, since it is a derived variables from some of the other physical activity variables, others will be looked at in the univariate step to identify potential issues of skewness or unusual values.
The graphical summary for each variable will serve to suggest transformations for each variable:
- no transformation (in case of approximate symmetry);
- \(\log_{10}(x+1)\) transformation (in case of skewness)
The distributions of transformed variables will be evaluated as well as described above.
It is assumed that the data have been cleaned, but unusual values will be identified and possibly excluded.
IDA domain: multivariate system of variables
5. Matrix/heatmap of Pearson correlation coefficients between all independent variables.
Pearson correlation coefficients will be computed between all independent variables. The correlation coefficients will be depicted by means of a (quadratic) heat map. Moreover, a network graph between all independent variables will be constructed, which will be thresholded at an absolute correlation coefficient of 0.3.
Spearman correlation coefficients will be computed as well, and the 10 pairs of covariates with the largest absolute difference between Pearson and Spearman correlation coefficients will be flagged. These pairs will be graphically investigated by constructing separate scatterplots.
6. Appropriate visual (and numerical) presentations of the association of each covariate with the two pivotal covariates.
A redundancy analysis will be conducted for each variable. This analysis identifies predictors that can be almost perfectly predicted by flexible parametric additive models performed on the companion covariates.
Categorical and continuous variables will be summarized with counts and proportions or medians and quartiles, as appropriate, in a table stratified by sex and age groups.
Scatterplots of continuous variables by age will be constructed stratified by gender.
7. If interactions between covariates were prespecified to be included in the regression model, special attention should be given to evaluate the bivariate distribution of the interacting covariates. Appropriate graphical displays (see 6) should be used to visualise these distributions.
Interactions between age and gender will be considered. The distribution of age will be depicted as histogram stratified for gender.
8. For a derived outcome variable, the bivariate distribution of these variables with the outcome variable should be evaluated with appropriate visualizations..
Scatter plots of the physical activity variables with MVPA will be constructed with trend lines.
Missing data
Per variable missingness
Number and percentage of missing.
| Variable | Missing (count) | Missing (%) |
|---|---|---|
| tac | 708 | 10.60 |
| tlac | 708 | 10.60 |
| mvpa | 708 | 10.60 |
| wt | 708 | 10.60 |
| tlac.1 | 708 | 10.60 |
| tlac.2 | 708 | 10.60 |
| tlac.3 | 708 | 10.60 |
| tlac.4 | 708 | 10.60 |
| tlac.5 | 708 | 10.60 |
| tlac.6 | 708 | 10.60 |
| tlac.7 | 708 | 10.60 |
| tlac.8 | 708 | 10.60 |
| tlac.9 | 708 | 10.60 |
| tlac.10 | 708 | 10.60 |
| tlac.11 | 708 | 10.60 |
| tlac.12 | 708 | 10.60 |
| sys | 320 | 4.79 |
| lbxtc | 270 | 4.04 |
| lbdhdd | 270 | 4.04 |
| bmi | 56 | 0.84 |
| permth.exm | 9 | 0.13 |
| mortstat | 9 | 0.13 |
| educationadult | 7 | 0.10 |
| smokecigs | 4 | 0.06 |
| age | 0 | 0.00 |
| gender | 0 | 0.00 |
| drinkstatus | 0 | 0.00 |
| diabetes | 0 | 0.00 |
| chf | 0 | 0.00 |
| cancer | 0 | 0.00 |
| stroke | 0 | 0.00 |
| mobilityproblem | 0 | 0.00 |
Variable summaries for complete vs incomplete cases
| complete (N=708) | incomplete (N=5972) | Total (N=6680) | p value | |
|---|---|---|---|---|
| age | < 0.001 | |||
| Median | 48.375 | 53.750 | 53.167 | |
| Q1, Q3 | 38.583, 64.271 | 41.646, 67.250 | 41.333, 67.000 | |
| Range | 30.000 - 84.917 | 30.000 - 84.917 | 30.000 - 84.917 | |
| gender | 0.432 | |||
| Male | 359 (50.7%) | 2935 (49.1%) | 3294 (49.3%) | |
| Female | 349 (49.3%) | 3037 (50.9%) | 3386 (50.7%) | |
| education level | 0.072 | |||
| N-Miss | 3 | 4 | 7 | |
| Less than high school | 216 (30.6%) | 1683 (28.2%) | 1899 (28.5%) | |
| High school | 186 (26.4%) | 1448 (24.3%) | 1634 (24.5%) | |
| More than high school | 303 (43.0%) | 2837 (47.5%) | 3140 (47.1%) | |
| body mass index | 0.796 | |||
| Median | 28.400 | 28.080 | 28.100 | |
| Q1, Q3 | 24.373, 32.353 | 24.730, 32.230 | 24.718, 32.250 | |
| Range | 16.570 - 72.280 | 13.360 - 130.210 | 13.360 - 130.210 | |
| smoking status | 0.051 | |||
| N-Miss | 2 | 2 | 4 | |
| Never | 342 (48.4%) | 2911 (48.8%) | 3253 (48.7%) | |
| Former | 185 (26.2%) | 1759 (29.5%) | 1944 (29.1%) | |
| Current | 179 (25.4%) | 1300 (21.8%) | 1479 (22.2%) | |
| alcohol consumption | 0.008 | |||
| Moderate Drinker | 359 (50.7%) | 3090 (51.7%) | 3449 (51.6%) | |
| Non-Drinker | 238 (33.6%) | 2098 (35.1%) | 2336 (35.0%) | |
| Heavy Drinker | 40 (5.6%) | 389 (6.5%) | 429 (6.4%) | |
| Missing alcohol | 71 (10.0%) | 395 (6.6%) | 466 (7.0%) | |
| Final mortality status | 0.205 | |||
| Median | 0.000 | 0.000 | 0.000 | |
| Q1, Q3 | 0.000, 0.000 | 0.000, 0.000 | 0.000, 0.000 | |
| Range | 0.000 - 1.000 | 0.000 - 1.000 | 0.000 - 1.000 | |
| diabetes | 0.659 | |||
| No | 614 (86.7%) | 5214 (87.3%) | 5828 (87.2%) | |
| Yes | 94 (13.3%) | 758 (12.7%) | 852 (12.8%) | |
| congestive heart failure | 0.538 | |||
| No | 677 (95.6%) | 5739 (96.1%) | 6416 (96.0%) | |
| Yes | 31 (4.4%) | 233 (3.9%) | 264 (4.0%) | |
| cancer | 0.106 | |||
| No | 649 (91.7%) | 5359 (89.7%) | 6008 (89.9%) | |
| Yes | 59 (8.3%) | 613 (10.3%) | 672 (10.1%) | |
| stroke | 0.163 | |||
| No | 672 (94.9%) | 5734 (96.0%) | 6406 (95.9%) | |
| Yes | 36 (5.1%) | 238 (4.0%) | 274 (4.1%) | |
| Systolic blood pressure | 0.090 | |||
| Median | 123.000 | 124.000 | 124.000 | |
| Q1, Q3 | 113.000, 135.000 | 113.000, 138.000 | 113.000, 138.000 | |
| Range | 82.000 - 228.000 | 73.000 - 270.000 | 73.000 - 270.000 | |
| Total cholesterol | 0.394 | |||
| Median | 199.000 | 201.000 | 201.000 | |
| Q1, Q3 | 172.000, 229.000 | 175.000, 228.000 | 175.000, 229.000 | |
| Range | 99.000 - 704.000 | 82.000 - 650.000 | 82.000 - 704.000 | |
| HDL cholesterol | 0.260 | |||
| Median | 50.000 | 52.000 | 52.000 | |
| Q1, Q3 | 42.000, 63.000 | 43.000, 64.000 | 42.000, 63.000 | |
| Range | 17.000 - 164.000 | 17.000 - 188.000 | 17.000 - 188.000 | |
| difficulties with mobility | 0.817 | |||
| No Difficulty | 542 (76.6%) | 4595 (76.9%) | 5137 (76.9%) | |
| Any Difficulty | 166 (23.4%) | 1377 (23.1%) | 1543 (23.1%) | |
| total log activity count (log(1+activity)) | ||||
| Median | NA | 2910.926 | 2910.926 | |
| Q1, Q3 | NA | 2384.757, 3430.648 | 2384.757, 3430.648 | |
| Range | NA | 313.083 - 6122.678 | 313.083 - 6122.678 | |
| total accelerometer wear time | ||||
| Median | NA | 852.071 | 852.071 | |
| Q1, Q3 | NA | 782.851, 922.036 | 782.851, 922.036 | |
| Range | NA | 600.000 - 1440.000 | 600.000 - 1440.000 |
It appears that participants with incomplete physical activity data are older.
Missingness patterns over variables
(In)complete cases
This section presents patients with a least one missing value. First we list out patients with at least one missing value in a filterable table.
Then we report the pattern of missing for this set of patients.
Section session info
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18363)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] grid stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] VIM_6.1.0 colorspace_2.0-0 arsenal_3.6.1 DT_0.17
## [5] kableExtra_1.3.1 gt_0.2.2 naniar_0.6.0 Hmisc_4.4-2
## [9] Formula_1.2-4 survival_3.1-12 lattice_0.20-41 forcats_0.5.1
## [13] stringr_1.4.0 dplyr_1.0.4 purrr_0.3.4 readr_1.4.0
## [17] tidyr_1.1.2 tibble_3.0.6 ggplot2_3.3.3 tidyverse_1.3.0
## [21] here_1.0.1
##
## loaded via a namespace (and not attached):
## [1] ellipsis_0.3.1 class_7.3-17 rio_0.5.16
## [4] visdat_0.5.3 rprojroot_2.0.2 htmlTable_2.1.0
## [7] base64enc_0.1-3 fs_1.5.0 rstudioapi_0.13
## [10] farver_2.0.3 lubridate_1.7.9.2 ranger_0.12.1
## [13] xml2_1.3.2 splines_4.0.2 robustbase_0.93-7
## [16] knitr_1.31 jsonlite_1.7.2 broom_0.7.4
## [19] cluster_2.1.0 dbplyr_2.1.0 png_0.1-7
## [22] compiler_4.0.2 httr_1.4.2 backports_1.2.1
## [25] assertthat_0.2.1 Matrix_1.2-18 cli_2.3.0
## [28] htmltools_0.5.1.1 tools_4.0.2 gtable_0.3.0
## [31] glue_1.4.2 Rcpp_1.0.6 carData_3.0-4
## [34] cellranger_1.1.0 vctrs_0.3.6 crosstalk_1.1.1
## [37] lmtest_0.9-38 xfun_0.20 laeken_0.5.1
## [40] openxlsx_4.2.3 rvest_0.3.6 lifecycle_0.2.0
## [43] DEoptimR_1.0-8 MASS_7.3-51.6 zoo_1.8-8
## [46] scales_1.1.1 hms_1.0.0 RColorBrewer_1.1-2
## [49] yaml_2.2.1 curl_4.3 gridExtra_2.3
## [52] UpSetR_1.4.0 sass_0.3.1 rpart_4.1-15
## [55] latticeExtra_0.6-29 stringi_1.5.3 highr_0.8
## [58] e1071_1.7-4 checkmate_2.0.0 boot_1.3-25
## [61] zip_2.1.1 rlang_0.4.10 pkgconfig_2.0.3
## [64] commonmark_1.7 evaluate_0.14 htmlwidgets_1.5.3
## [67] labeling_0.4.2 tidyselect_1.1.0 plyr_1.8.6
## [70] magrittr_2.0.1 bookdown_0.21 R6_2.5.0
## [73] generics_0.1.0 DBI_1.1.1 pillar_1.4.7
## [76] haven_2.3.1 foreign_0.8-80 withr_2.4.1
## [79] abind_1.4-5 sp_1.4-5 nnet_7.3-14
## [82] modelr_0.1.8 crayon_1.4.1 car_3.0-10
## [85] rmarkdown_2.6 jpeg_0.1-8.1 readxl_1.3.1
## [88] data.table_1.13.6 rmdformats_1.0.1 vcd_1.4-8
## [91] reprex_1.0.0 digest_0.6.27 webshot_0.5.2
## [94] munsell_0.5.0 viridisLite_0.3.0
Univariate distribution checks
This section reports a series of univariate summary checks of the NHANES dataset.
## Rows: 6,680
## Columns: 33
## $ seqn <labelled> 21009, 21010, 21012, 21015, 21017, 21018, 2101...
## $ age <labelled> 56.00000, 52.83333, 63.83333, 83.91667, 37.083...
## $ gender <fct> Male, Female, Male, Male, Female, Female, Female, F...
## $ permth.exm <labelled> 135, 149, 127, 24, 151, 154, 153, 154, 141, 14...
## $ mortstat <labelled> 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0...
## $ educationadult <fct> High school, More than high school, High school, Mo...
## $ smokecigs <fct> Never, Current, Current, Former, Current, Never, Fo...
## $ drinkstatus <fct> Non-Drinker, Heavy Drinker, Missing alcohol, Non-Dr...
## $ bmi <labelled> 31.26, 25.49, 19.60, 28.32, 19.34, 16.57, 38.0...
## $ diabetes <fct> No, No, No, No, No, No, No, No, No, No, No, No, Yes...
## $ chf <fct> No, No, No, No, No, No, Yes, No, No, No, No, No, No...
## $ cancer <fct> No, No, No, Yes, No, No, No, No, No, No, Yes, No, N...
## $ stroke <fct> No, No, No, No, No, No, No, No, No, No, No, No, No,...
## $ sys <labelled> 120, 133, 123, 154, 103, 137, 115, 131, 121, 1...
## $ lbxtc <labelled> 254, 174, 191, 141, 184, NA, 173, 230, 261, 21...
## $ lbdhdd <labelled> 37, 119, 92, 34, 77, NA, 45, 51, 29, 68, 53, 4...
## $ mobilityproblem <fct> No Difficulty, No Difficulty, Any Difficulty, Any D...
## $ tac <labelled> 409352.71, 286407.71, 130778.29, 102562.86, 41...
## $ tlac <labelled> 3522.427, 3334.503, 2749.086, 2103.580, 3689.4...
## $ mvpa <labelled> 48.285714, 9.428571, 4.714286, 3.000000, 58.83...
## $ wt <labelled> 900.2857, 783.4286, 1053.0000, 813.1429, 833.8...
## $ tlac.1 <labelled> 0.0000000, 0.0000000, 161.7450224, 5.6786074, ...
## $ tlac.2 <labelled> 0.000000, 0.000000, 128.091725, 7.244960, 0.00...
## $ tlac.3 <labelled> 66.563485, 0.000000, 145.091848, 8.942295, 0.0...
## $ tlac.4 <labelled> 476.33325, 0.00000, 74.34726, 18.38650, 459.83...
## $ tlac.5 <labelled> 612.21257, 358.95003, 152.73994, 106.23999, 57...
## $ tlac.6 <labelled> 586.1977, 449.0983, 249.8184, 286.1406, 571.73...
## $ tlac.7 <labelled> 462.8831, 514.9402, 352.7644, 393.9136, 637.96...
## $ tlac.8 <labelled> 587.5167, 550.7981, 277.3621, 321.9635, 634.96...
## $ tlac.9 <labelled> 315.9585, 487.6278, 302.4908, 327.1102, 254.62...
## $ tlac.10 <labelled> 251.4170, 527.7868, 310.0226, 313.5571, 338.45...
## $ tlac.11 <labelled> 159.78640, 401.07109, 312.24852, 254.92505, 19...
## $ tlac.12 <labelled> 3.558282, 44.230766, 282.363630, 59.477670, 26...
Data set overview
Using the Hmisc describe function, we provide an overview of the data set. The descriptive report also provides histograms of continuous variables. For ease of scanning the information, we group the report by measurement type.
Demographic and lifestyle variables
6 Variables 5972 Observations
age years
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5972 | 0 | 660 | 1 | 54.87 | 17.64 | 32.17 | 34.58 | 41.65 | 53.75 | 67.25 | 76.50 | 80.83 |
gender
| n | missing | distinct |
|---|---|---|
| 5972 | 0 | 2 |
Value Male Female Frequency 2935 3037 Proportion 0.491 0.509
educationadult: education level
| n | missing | distinct |
|---|---|---|
| 5968 | 4 | 3 |
Value Less than high school High school More than high school Frequency 1683 1448 2837 Proportion 0.282 0.243 0.475
bmi: body mass index kg/m2
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5928 | 44 | 2161 | 1 | 29.07 | 6.804 | 20.78 | 22.12 | 24.73 | 28.08 | 32.23 | 37.07 | 40.80 |
smokecigs: smoking status
| n | missing | distinct |
|---|---|---|
| 5970 | 2 | 3 |
Value Never Former Current Frequency 2911 1759 1300 Proportion 0.488 0.295 0.218
drinkstatus: alcohol consumption
| n | missing | distinct |
|---|---|---|
| 5972 | 0 | 4 |
Value Moderate Drinker Non-Drinker Heavy Drinker Missing alcohol Frequency 3090 2098 389 395 Proportion 0.517 0.351 0.065 0.066
Physiological measurements
3 Variables 5972 Observations
sys: Systolic blood pressure mg/dl
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5698 | 274 | 137 | 1 | 127.4 | 22.26 | 100.0 | 105.0 | 113.0 | 124.0 | 138.0 | 154.0 | 166.1 |
lbxtc: Total cholesterol mg/dL
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5742 | 230 | 264 | 1 | 204.1 | 46.4 | 143 | 155 | 175 | 201 | 228 | 258 | 277 |
lbdhdd: HDL cholesterol mg/dL
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5742 | 230 | 109 | 1 | 54.64 | 17.91 | 33 | 37 | 43 | 52 | 64 | 76 | 85 |
Comorbidities
5 Variables 5972 Observations
mortstat: Final mortality status
| n | missing | distinct | Info | Sum | Mean | Gmd |
|---|---|---|---|---|---|---|
| 5964 | 8 | 2 | 0.441 | 1068 | 0.1791 | 0.2941 |
diabetes
| n | missing | distinct |
|---|---|---|
| 5972 | 0 | 2 |
Value No Yes Frequency 5214 758 Proportion 0.873 0.127
chf: congestive heart failure
| n | missing | distinct |
|---|---|---|
| 5972 | 0 | 2 |
Value No Yes Frequency 5739 233 Proportion 0.961 0.039
cancer
| n | missing | distinct |
|---|---|---|
| 5972 | 0 | 2 |
Value No Yes Frequency 5359 613 Proportion 0.897 0.103
stroke
| n | missing | distinct |
|---|---|---|
| 5972 | 0 | 2 |
Value No Yes Frequency 5734 238 Proportion 0.96 0.04
Physical activity variables
17 Variables 5972 Observations
mobilityproblem: difficulties with mobility
| n | missing | distinct |
|---|---|---|
| 5972 | 0 | 2 |
Value No Difficulty Any Difficulty Frequency 4595 1377 Proportion 0.769 0.231
tac: total activity counts per day
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5972 | 0 | 5965 | 1 | 244811 | 143738 | 69233 | 94872 | 150571 | 223572 | 314224 | 417410 | 486450 |
| lowest : | 8263.000 | 8931.833 | 12123.000 | 14642.000 | 15656.000 |
| highest: | 981517.167 | 986261.000 | 986593.800 | 1097823.500 | 1122542.600 |
tlac: total log activity count (log(1+activity))
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5972 | 0 | 5969 | 1 | 2900 | 873.5 | 1613 | 1900 | 2385 | 2911 | 3431 | 3877 | 4164 |
| lowest : | 313.0835 | 364.4561 | 400.8157 | 429.9288 | 466.0362 |
| highest: | 5436.1548 | 5492.5395 | 5588.3401 | 5655.4680 | 6122.6779 |
mvpa: Moderate or vigorous physical activity minutes
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5972 | 0 | 1163 | 1 | 19.19 | 20.9 | 0.800 | 1.429 | 4.000 | 12.000 | 26.762 | 46.000 | 59.921 |
| lowest : | 0.0000000 | 0.1428571 | 0.1666667 | 0.2000000 | 0.2500000 |
| highest: | 180.8333333 | 186.2000000 | 194.8000000 | 208.5000000 | 249.0000000 |
wt: total accelerometer wear time minutes
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5972 | 0 | 3613 | 1 | 866.1 | 139.8 | 684.3 | 721.0 | 782.9 | 852.1 | 922.0 | 1000.6 | 1111.5 |
tlac.1: total log actvity count 12:00AM-2:00AM
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5972 | 0 | 2656 | 0.829 | 30.92 | 51.83 | 0.00 | 0.00 | 0.00 | 0.00 | 24.38 | 94.43 | 169.25 |
| lowest : | 0.0000000 | 0.1569446 | 0.1831020 | 0.2299197 | 0.2559656 |
| highest: | 597.3808309 | 620.0469233 | 674.1677375 | 709.3300116 | 719.0239316 |
tlac.2: total log actvity count 2:00AM-4:00AM
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5972 | 0 | 1770 | 0.653 | 19.09 | 34.47 | 0.00 | 0.00 | 0.00 | 0.00 | 2.91 | 51.83 | 110.64 |
| lowest : | 0.00000000 | 0.09902103 | 0.11552453 | 0.15694461 | 0.23104906 |
| highest: | 586.34967162 | 611.00545824 | 617.44773130 | 737.25383394 | 775.42871350 |
tlac.3: total log actvity count 4:00AM-6:00AM
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5972 | 0 | 2834 | 0.855 | 43.29 | 70.78 | 0.00 | 0.00 | 0.00 | 0.00 | 38.74 | 147.59 | 248.43 |
| lowest : | 0.0000000 | 0.1155245 | 0.1386294 | 0.2299197 | 0.2682397 |
| highest: | 679.1484297 | 697.1093552 | 704.5766819 | 719.3198459 | 769.6014301 |
tlac.4: total log actvity count 6:00AM-8:00AM
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5972 | 0 | 5285 | 0.998 | 177 | 178.6 | 0.00 | 0.00 | 36.94 | 137.34 | 282.09 | 416.35 | 496.25 |
| lowest : | 0.0000000 | 0.2299197 | 0.3465736 | 0.6148132 | 0.6839274 |
| highest: | 774.8811640 | 792.6938042 | 822.1482092 | 832.9933042 | 857.9018816 |
tlac.5: total log actvity count 8:00AM-10:00AM
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5972 | 0 | 5834 | 1 | 339.3 | 191.7 | 39.52 | 102.56 | 221.28 | 346.74 | 460.18 | 552.17 | 610.19 |
| lowest : | 0.0000000 | 0.2310491 | 0.7250248 | 0.8652549 | 1.0357837 |
| highest: | 812.0225306 | 812.8675420 | 813.2942210 | 824.5800445 | 888.1759271 |
tlac.6: total log actvity count 10:00AM-12:00PM
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5972 | 0 | 5931 | 1 | 407.7 | 163.6 | 150.4 | 218.6 | 316.2 | 415.0 | 506.7 | 589.9 | 634.9 |
| lowest : | 0.0000000 | 0.6986213 | 2.6001909 | 4.5903937 | 5.7234361 |
| highest: | 807.7712473 | 808.7247458 | 811.5701740 | 884.1169241 | 892.0314653 |
tlac.7: total log actvity count 12:00PM-2:00PM
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5972 | 0 | 5947 | 1 | 418 | 146.9 | 192.1 | 250.4 | 337.6 | 423.5 | 507.2 | 581.3 | 623.7 |
| lowest : | 0.000000 | 1.734669 | 2.704424 | 5.605670 | 6.387910 |
| highest: | 788.370472 | 796.082067 | 813.380498 | 821.733575 | 885.445891 |
tlac.8: total log actvity count 2:00PM-4:00PM
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5972 | 0 | 5954 | 1 | 411.7 | 147.8 | 192.1 | 243.1 | 323.6 | 414.3 | 501.7 | 577.5 | 619.9 |
| lowest : | 0.000000 | 1.974752 | 3.096473 | 4.094345 | 5.772020 |
| highest: | 792.683985 | 837.042353 | 846.553847 | 877.212734 | 904.872351 |
tlac.9: total log actvity count 4:00PM-6:00PM
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5972 | 0 | 5955 | 1 | 397 | 140.3 | 185.4 | 234.8 | 316.4 | 401.8 | 483.5 | 553.6 | 591.4 |
| lowest : | 0.000000 | 2.957040 | 3.401197 | 4.148165 | 5.084134 |
| highest: | 771.497952 | 783.128869 | 801.039991 | 809.429425 | 822.294800 |
tlac.10: total log actvity count 6:00PM-8:00PM
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5972 | 0 | 5932 | 1 | 337.6 | 151.3 | 114.1 | 165.5 | 246.6 | 339.5 | 433.1 | 504.4 | 548.9 |
| lowest : | 0.000000 | 1.311822 | 1.353699 | 1.753975 | 3.459493 |
| highest: | 778.168243 | 778.774433 | 802.020060 | 851.421446 | 860.123328 |
tlac.11: total log actvity count 8:00PM-10:00PM
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5972 | 0 | 5786 | 1 | 223.2 | 158.2 | 10.22 | 42.32 | 116.77 | 212.72 | 315.90 | 411.75 | 471.84 |
| lowest : | 0.0000000 | 0.6229449 | 0.6708919 | 1.0233141 | 1.0525597 |
| highest: | 724.9040071 | 753.8848070 | 821.4989318 | 826.3463412 | 839.8942777 |
tlac.12: total log actvity count 10:00PM-12:00AM
n missing distinct Info Mean Gmd .05 .10 .25
5972 0 4943 0.995 95.37 114.3 0.000 0.000 6.693
.50 .75 .90 .95
55.438 141.863 251.308 328.945
| lowest : | 0.00000000 | 0.09902103 | 0.17328680 | 0.27798716 | 0.41291025 |
| highest: | 683.58618305 | 698.46723961 | 702.66304648 | 707.15487443 | 733.61717206 |
Categorical variables
We now provide a closer visual examination of the categorical predictors.
Continuous variables
A closer visual examination of continuous predictors and the outcome variable.
There is evidence of influential points in some of the distributions. This is explored further with targeted summaries. A more detailed univariate summaries for the variables of interest are also provided below.
Age
## Warning: Removed 13 rows containing missing values (geom_point).
## Warning: Removed 2 rows containing missing values (geom_bar).
Distribution of age
Blood pressure
Distribution of SBP
Body mass index
Distribution of respiratory rate
There is a participant with an unusal high value (130.2). It is possible that this is an entry error (bmi=30.2).
Total cholesterol
Distribution of total cholesterol
Distribution of HDL
Section session info
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18363)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] Hmisc_4.4-2 Formula_1.2-4 survival_3.1-12 lattice_0.20-41
## [5] forcats_0.5.1 stringr_1.4.0 dplyr_1.0.4 purrr_0.3.4
## [9] readr_1.4.0 tidyr_1.1.2 tibble_3.0.6 ggplot2_3.3.3
## [13] tidyverse_1.3.0 here_1.0.1
##
## loaded via a namespace (and not attached):
## [1] httr_1.4.2 jsonlite_1.7.2 splines_4.0.2
## [4] modelr_0.1.8 assertthat_0.2.1 highr_0.8
## [7] latticeExtra_0.6-29 cellranger_1.1.0 yaml_2.2.1
## [10] pillar_1.4.7 backports_1.2.1 glue_1.4.2
## [13] digest_0.6.27 RColorBrewer_1.1-2 checkmate_2.0.0
## [16] rvest_0.3.6 colorspace_2.0-0 htmltools_0.5.1.1
## [19] Matrix_1.2-18 pkgconfig_2.0.3 broom_0.7.4
## [22] haven_2.3.1 bookdown_0.21 patchwork_1.1.1
## [25] scales_1.1.1 jpeg_0.1-8.1 htmlTable_2.1.0
## [28] generics_0.1.0 farver_2.0.3 ellipsis_0.3.1
## [31] withr_2.4.1 nnet_7.3-14 cli_2.3.0
## [34] magrittr_2.0.1 crayon_1.4.1 readxl_1.3.1
## [37] evaluate_0.14 fs_1.5.0 fansi_0.4.2
## [40] xml2_1.3.2 foreign_0.8-80 tools_4.0.2
## [43] data.table_1.13.6 hms_1.0.0 lifecycle_0.2.0
## [46] munsell_0.5.0 reprex_1.0.0 cluster_2.1.0
## [49] compiler_4.0.2 rlang_0.4.10 grid_4.0.2
## [52] rstudioapi_0.13 htmlwidgets_1.5.3 base64enc_0.1-3
## [55] labeling_0.4.2 rmarkdown_2.6 gtable_0.3.0
## [58] DBI_1.1.1 R6_2.5.0 gridExtra_2.3
## [61] lubridate_1.7.9.2 knitr_1.31 utf8_1.1.4
## [64] rprojroot_2.0.2 stringi_1.5.3 rmdformats_1.0.1
## [67] Rcpp_1.0.6 vctrs_0.3.6 rpart_4.1-15
## [70] png_0.1-7 dbplyr_2.1.0 tidyselect_1.1.0
## [73] xfun_0.20
Multivariate distributions
This section reports a series of multivariate summaries of the NHANES dataset.
Overview
Variable correlation
Correlations of the physical activity variables (outcome)
Variable clustering
Variable clustering is used for assessing collinearity, redundancy, and for separating variables into clusters that can be scored as a single variable, thus resulting in data reduction.
## Hmisc::varclus(x = ~age + gender + bmi + sys + lbxtc + lbdhdd +
## smokecigs + drinkstatus + mortstat + diabetes + chf + cancer +
## stroke, data = a_nhanes)
##
##
## Similarity matrix (Spearman rho^2)
##
## age genderFemale bmi sys lbxtc lbdhdd
## age 1.00 0.00 0.00 0.19 0.00 0.00
## genderFemale 0.00 1.00 0.00 0.00 0.01 0.13
## bmi 0.00 0.00 1.00 0.02 0.00 0.08
## sys 0.19 0.00 0.02 1.00 0.00 0.00
## lbxtc 0.00 0.01 0.00 0.00 1.00 0.02
## lbdhdd 0.00 0.13 0.08 0.00 0.02 1.00
## smokecigsFormer 0.07 0.02 0.00 0.01 0.00 0.00
## smokecigsCurrent 0.02 0.01 0.01 0.00 0.00 0.01
## drinkstatusNon-Drinker 0.05 0.01 0.01 0.01 0.00 0.01
## drinkstatusHeavy Drinker 0.00 0.01 0.01 0.00 0.00 0.01
## drinkstatusMissing alcohol 0.01 0.00 0.00 0.00 0.00 0.00
## mortstat 0.17 0.01 0.00 0.04 0.01 0.00
## diabetesYes 0.04 0.00 0.02 0.01 0.01 0.01
## chfYes 0.03 0.00 0.00 0.00 0.01 0.00
## cancerYes 0.06 0.00 0.00 0.00 0.00 0.00
## strokeYes 0.03 0.00 0.00 0.01 0.00 0.00
## smokecigsFormer smokecigsCurrent
## age 0.07 0.02
## genderFemale 0.02 0.01
## bmi 0.00 0.01
## sys 0.01 0.00
## lbxtc 0.00 0.00
## lbdhdd 0.00 0.01
## smokecigsFormer 1.00 0.12
## smokecigsCurrent 0.12 1.00
## drinkstatusNon-Drinker 0.00 0.02
## drinkstatusHeavy Drinker 0.00 0.03
## drinkstatusMissing alcohol 0.00 0.00
## mortstat 0.02 0.00
## diabetesYes 0.00 0.00
## chfYes 0.01 0.00
## cancerYes 0.01 0.00
## strokeYes 0.00 0.00
## drinkstatusNon-Drinker drinkstatusHeavy Drinker
## age 0.05 0.00
## genderFemale 0.01 0.01
## bmi 0.01 0.01
## sys 0.01 0.00
## lbxtc 0.00 0.00
## lbdhdd 0.01 0.01
## smokecigsFormer 0.00 0.00
## smokecigsCurrent 0.02 0.03
## drinkstatusNon-Drinker 1.00 0.04
## drinkstatusHeavy Drinker 0.04 1.00
## drinkstatusMissing alcohol 0.04 0.01
## mortstat 0.02 0.00
## diabetesYes 0.02 0.00
## chfYes 0.01 0.00
## cancerYes 0.00 0.00
## strokeYes 0.01 0.00
## drinkstatusMissing alcohol mortstat diabetesYes
## age 0.01 0.17 0.04
## genderFemale 0.00 0.01 0.00
## bmi 0.00 0.00 0.02
## sys 0.00 0.04 0.01
## lbxtc 0.00 0.01 0.01
## lbdhdd 0.00 0.00 0.01
## smokecigsFormer 0.00 0.02 0.00
## smokecigsCurrent 0.00 0.00 0.00
## drinkstatusNon-Drinker 0.04 0.02 0.02
## drinkstatusHeavy Drinker 0.01 0.00 0.00
## drinkstatusMissing alcohol 1.00 0.00 0.00
## mortstat 0.00 1.00 0.03
## diabetesYes 0.00 0.03 1.00
## chfYes 0.00 0.04 0.03
## cancerYes 0.00 0.03 0.00
## strokeYes 0.00 0.03 0.02
## chfYes cancerYes strokeYes
## age 0.03 0.06 0.03
## genderFemale 0.00 0.00 0.00
## bmi 0.00 0.00 0.00
## sys 0.00 0.00 0.01
## lbxtc 0.01 0.00 0.00
## lbdhdd 0.00 0.00 0.00
## smokecigsFormer 0.01 0.01 0.00
## smokecigsCurrent 0.00 0.00 0.00
## drinkstatusNon-Drinker 0.01 0.00 0.01
## drinkstatusHeavy Drinker 0.00 0.00 0.00
## drinkstatusMissing alcohol 0.00 0.00 0.00
## mortstat 0.04 0.03 0.03
## diabetesYes 0.03 0.00 0.02
## chfYes 1.00 0.00 0.02
## cancerYes 0.00 1.00 0.00
## strokeYes 0.02 0.00 1.00
##
## No. of observations used for each pair:
##
## age genderFemale bmi sys lbxtc lbdhdd
## age 6680 6680 6624 6360 6410 6410
## genderFemale 6680 6680 6624 6360 6410 6410
## bmi 6624 6624 6624 6316 6358 6358
## sys 6360 6360 6316 6360 6133 6133
## lbxtc 6410 6410 6358 6133 6410 6410
## lbdhdd 6410 6410 6358 6133 6410 6410
## smokecigsFormer 6676 6676 6621 6356 6406 6406
## smokecigsCurrent 6676 6676 6621 6356 6406 6406
## drinkstatusNon-Drinker 6680 6680 6624 6360 6410 6410
## drinkstatusHeavy Drinker 6680 6680 6624 6360 6410 6410
## drinkstatusMissing alcohol 6680 6680 6624 6360 6410 6410
## mortstat 6671 6671 6615 6351 6401 6401
## diabetesYes 6680 6680 6624 6360 6410 6410
## chfYes 6680 6680 6624 6360 6410 6410
## cancerYes 6680 6680 6624 6360 6410 6410
## strokeYes 6680 6680 6624 6360 6410 6410
## smokecigsFormer smokecigsCurrent
## age 6676 6676
## genderFemale 6676 6676
## bmi 6621 6621
## sys 6356 6356
## lbxtc 6406 6406
## lbdhdd 6406 6406
## smokecigsFormer 6676 6676
## smokecigsCurrent 6676 6676
## drinkstatusNon-Drinker 6676 6676
## drinkstatusHeavy Drinker 6676 6676
## drinkstatusMissing alcohol 6676 6676
## mortstat 6667 6667
## diabetesYes 6676 6676
## chfYes 6676 6676
## cancerYes 6676 6676
## strokeYes 6676 6676
## drinkstatusNon-Drinker drinkstatusHeavy Drinker
## age 6680 6680
## genderFemale 6680 6680
## bmi 6624 6624
## sys 6360 6360
## lbxtc 6410 6410
## lbdhdd 6410 6410
## smokecigsFormer 6676 6676
## smokecigsCurrent 6676 6676
## drinkstatusNon-Drinker 6680 6680
## drinkstatusHeavy Drinker 6680 6680
## drinkstatusMissing alcohol 6680 6680
## mortstat 6671 6671
## diabetesYes 6680 6680
## chfYes 6680 6680
## cancerYes 6680 6680
## strokeYes 6680 6680
## drinkstatusMissing alcohol mortstat diabetesYes
## age 6680 6671 6680
## genderFemale 6680 6671 6680
## bmi 6624 6615 6624
## sys 6360 6351 6360
## lbxtc 6410 6401 6410
## lbdhdd 6410 6401 6410
## smokecigsFormer 6676 6667 6676
## smokecigsCurrent 6676 6667 6676
## drinkstatusNon-Drinker 6680 6671 6680
## drinkstatusHeavy Drinker 6680 6671 6680
## drinkstatusMissing alcohol 6680 6671 6680
## mortstat 6671 6671 6671
## diabetesYes 6680 6671 6680
## chfYes 6680 6671 6680
## cancerYes 6680 6671 6680
## strokeYes 6680 6671 6680
## chfYes cancerYes strokeYes
## age 6680 6680 6680
## genderFemale 6680 6680 6680
## bmi 6624 6624 6624
## sys 6360 6360 6360
## lbxtc 6410 6410 6410
## lbdhdd 6410 6410 6410
## smokecigsFormer 6676 6676 6676
## smokecigsCurrent 6676 6676 6676
## drinkstatusNon-Drinker 6680 6680 6680
## drinkstatusHeavy Drinker 6680 6680 6680
## drinkstatusMissing alcohol 6680 6680 6680
## mortstat 6671 6671 6671
## diabetesYes 6680 6680 6680
## chfYes 6680 6680 6680
## cancerYes 6680 6680 6680
## strokeYes 6680 6680 6680
##
## hclust results (method=complete)
##
##
## Call:
## hclust(d = as.dist(1 - x), method = method)
##
## Cluster method : complete
## Number of objects: 16
Plot associations.
Variable redundancy
Redundancy analysis of predictor variables.
##
## Redundancy Analysis
##
## Hmisc::redun(formula = ~age + gender + bmi + sys + lbxtc + lbdhdd +
## smokecigs + drinkstatus + mortstat + diabetes + chf + cancer +
## stroke, data = a_nhanes)
##
## n: 6080 p: 13 nk: 3
##
## Number of NAs: 600
## Frequencies of Missing Values Due to Each Variable
## age gender bmi sys lbxtc lbdhdd
## 0 0 56 320 270 270
## smokecigs drinkstatus mortstat diabetes chf cancer
## 4 0 9 0 0 0
## stroke
## 0
##
##
## Transformation of target variables forced to be linear
##
## R-squared cutoff: 0.9 Type: ordinary
##
## R^2 with which each variable can be predicted from all other variables:
##
## age gender bmi sys lbxtc lbdhdd
## 0.417 0.222 0.156 0.207 0.057 0.274
## smokecigs drinkstatus mortstat diabetes chf cancer
## 0.116 0.142 0.282 0.110 0.091 0.080
## stroke
## 0.062
##
## No redundant variables
Summary reports by age and gender
Distribution of age by gender
Distribution of age by gender
Summary report by age group and gender
Summary report by gender
| Baseline characteristics by gender. | |||
| N |
Male N=3294 |
Female N=3386 |
|
|---|---|---|---|
age years |
6680 | 41.8 53.8 68.0 55.1 ± 15.3 |
40.8 52.4 66.2 54.0 ± 15.4 |
body mass index kg/m2 |
6624 | 24.99 27.94 31.26 28.58 ± 5.64 |
24.40 28.31 33.37 29.55 ± 7.10 |
| education level : Less than high school | 6673 | 0.30 974/3289 | 0.27 925/3384 |
| High school | 0.24 798/3289 | 0.25 836/3384 | |
| More than high school | 0.46 1517/3289 | 0.48 1623/3384 | |
Systolic blood pressure mg/dl |
6360 | 115.0 125.0 137.0 127.7 ± 18.2 |
111.0 123.0 139.0 126.8 ± 22.4 |
Total cholesterol mg/dL |
6410 | 172.0 198.0 225.0 200.6 ± 42.8 |
178.0 204.0 231.0 207.1 ± 42.6 |
HDL cholesterol mg/dL |
6410 | 40.0 46.0 56.0 49.0 ± 13.9 |
48.0 58.0 70.0 60.1 ± 17.2 |
| smoking status : Never | 6676 | 0.38 1266/3292 | 0.59 1987/3384 |
| Former | 0.35 1167/3292 | 0.23 777/3384 | |
| Current | 0.26 859/3292 | 0.18 620/3384 | |
| alcohol consumption : Moderate Drinker | 6680 | 0.56 1846/3294 | 0.47 1603/3386 |
| Non-Drinker | 0.29 964/3294 | 0.41 1372/3386 | |
| Heavy Drinker | 0.08 276/3294 | 0.05 153/3386 | |
| Missing alcohol | 0.06 208/3294 | 0.08 258/3386 | |
| Final mortality status | 6671 | 0.21 692/3291 | 0.14 489/3380 |
| diabetes : Yes | 6680 | 0.13 425/3294 | 0.13 427/3386 |
| congestive heart failure : Yes | 6680 | 0.05 160/3294 | 0.03 104/3386 |
| cancer : Yes | 6680 | 0.09 306/3294 | 0.11 366/3386 |
| stroke : Yes | 6680 | 0.04 136/3294 | 0.04 138/3386 |
| a b c represent the lower quartile a, the median b, and the upper quartile c for continuous variables. x ± s represents X ± 1 SD. N is the number of non-missing values. | |||
Summary report by age group for men
| Baseline characteristics by gae group for men. | |||||
| N |
30-44 N=1048 |
45-59 N=900 |
60-74 N=931 |
75+ N=415 |
|
|---|---|---|---|---|---|
body mass index kg/m2 |
3272 | 24.86 27.93 31.32 28.80 ± 6.60 |
25.35 28.05 31.38 28.78 ± 5.41 |
25.41 28.23 31.81 28.78 ± 5.10 |
24.21 26.81 29.72 27.18 ± 4.35 |
| education level : Less than high school | 3289 | 0.25 259/1048 | 0.22 195/ 898 | 0.38 349/ 930 | 0.41 171/ 413 |
| High school | 0.25 264/1048 | 0.26 233/ 898 | 0.23 212/ 930 | 0.22 89/ 413 | |
| More than high school | 0.50 525/1048 | 0.52 470/ 898 | 0.40 369/ 930 | 0.37 153/ 413 | |
Systolic blood pressure mg/dl |
3164 | 112.0 119.0 129.0 121.0 ± 12.4 |
115.0 123.0 134.5 126.2 ± 17.1 |
120.0 131.0 145.0 133.3 ± 19.9 |
120.0 133.0 147.0 134.9 ± 21.9 |
Total cholesterol mg/dL |
3180 | 177.0 200.0 229.0 204.0 ± 42.1 |
178.0 204.0 231.0 206.7 ± 42.8 |
168.0 193.0 222.0 197.0 ± 44.1 |
158.5 185.0 212.5 187.2 ± 37.6 |
HDL cholesterol mg/dL |
3180 | 39.0 45.0 54.0 47.8 ± 14.1 |
40.0 47.0 57.0 49.2 ± 14.1 |
40.0 46.0 56.0 49.1 ± 13.4 |
41.0 47.0 58.0 50.7 ± 14.2 |
| smoking status : Never | 3292 | 0.49 518/1048 | 0.39 351/ 900 | 0.28 262/ 930 | 0.33 135/ 414 |
| Former | 0.19 196/1048 | 0.28 253/ 900 | 0.51 472/ 930 | 0.59 246/ 414 | |
| Current | 0.32 334/1048 | 0.33 296/ 900 | 0.21 196/ 930 | 0.08 33/ 414 | |
| alcohol consumption : Moderate Drinker | 3294 | 0.62 649/1048 | 0.59 533/ 900 | 0.51 477/ 931 | 0.45 187/ 415 |
| Non-Drinker | 0.19 201/1048 | 0.25 226/ 900 | 0.37 345/ 931 | 0.46 192/ 415 | |
| Heavy Drinker | 0.10 104/1048 | 0.10 87/ 900 | 0.08 71/ 931 | 0.03 14/ 415 | |
| Missing alcohol | 0.09 94/1048 | 0.06 54/ 900 | 0.04 38/ 931 | 0.05 22/ 415 | |
| diabetes : Yes | 3294 | 0.05 49/1048 | 0.10 90/ 900 | 0.23 218/ 931 | 0.16 68/ 415 |
| congestive heart failure : Yes | 3294 | 0.01 6/1048 | 0.03 28/ 900 | 0.09 80/ 931 | 0.11 46/ 415 |
| cancer : Yes | 3294 | 0.02 17/1048 | 0.04 39/ 900 | 0.15 137/ 931 | 0.27 113/ 415 |
| stroke : Yes | 3294 | 0.00 4/1048 | 0.02 15/ 900 | 0.07 69/ 931 | 0.12 48/ 415 |
| a b c represent the lower quartile a, the median b, and the upper quartile c for continuous variables. x ± s represents X ± 1 SD. N is the number of non-missing values. | |||||
Summary report by age group for women
| Baseline characteristics by gae group for men. | |||||
| N |
30-44 N=1164 |
45-59 N=924 |
60-74 N=905 |
75+ N=393 |
|
|---|---|---|---|---|---|
body mass index kg/m2 |
3352 | 23.99 28.00 33.37 29.34 ± 7.26 |
24.66 29.18 35.06 30.45 ± 7.67 |
24.92 28.82 33.36 29.86 ± 6.74 |
23.59 26.82 30.14 27.25 ± 5.31 |
| education level : Less than high school | 3384 | 0.22 258/1163 | 0.20 186/ 924 | 0.36 323/ 905 | 0.40 158/ 392 |
| High school | 0.20 237/1163 | 0.24 226/ 924 | 0.28 250/ 905 | 0.31 123/ 392 | |
| More than high school | 0.57 668/1163 | 0.55 512/ 924 | 0.37 332/ 905 | 0.28 111/ 392 | |
Systolic blood pressure mg/dl |
3196 | 104.0 112.0 120.0 113.2 ± 13.4 |
113.0 123.0 135.0 125.6 ± 19.5 |
121.0 135.0 150.2 137.2 ± 22.1 |
131.0 143.0 159.0 145.9 ± 24.9 |
Total cholesterol mg/dL |
3230 | 170.0 195.0 223.0 199.0 ± 42.5 |
182.0 208.0 232.0 209.6 ± 41.7 |
187.0 212.0 239.0 215.0 ± 42.2 |
176.0 204.0 234.0 207.1 ± 41.8 |
HDL cholesterol mg/dL |
3230 | 47.0 57.0 69.0 59.5 ± 17.3 |
47.0 57.0 70.0 60.1 ± 17.5 |
49.0 57.5 69.0 60.1 ± 16.7 |
49.0 60.0 73.0 62.0 ± 17.2 |
| smoking status : Never | 3384 | 0.64 745/1164 | 0.54 499/ 923 | 0.56 504/ 904 | 0.61 239/ 393 |
| Former | 0.14 159/1164 | 0.23 216/ 923 | 0.30 273/ 904 | 0.33 129/ 393 | |
| Current | 0.22 260/1164 | 0.23 208/ 923 | 0.14 127/ 904 | 0.06 25/ 393 | |
| alcohol consumption : Moderate Drinker | 3386 | 0.57 661/1164 | 0.51 471/ 924 | 0.39 352/ 905 | 0.30 119/ 393 |
| Non-Drinker | 0.29 339/1164 | 0.34 316/ 924 | 0.53 484/ 905 | 0.59 233/ 393 | |
| Heavy Drinker | 0.04 46/1164 | 0.07 64/ 924 | 0.03 31/ 905 | 0.03 12/ 393 | |
| Missing alcohol | 0.10 118/1164 | 0.08 73/ 924 | 0.04 38/ 905 | 0.07 29/ 393 | |
| diabetes : Yes | 3386 | 0.03 40/1164 | 0.13 121/ 924 | 0.22 195/ 905 | 0.18 71/ 393 |
| congestive heart failure : Yes | 3386 | 0.01 8/1164 | 0.02 19/ 924 | 0.05 42/ 905 | 0.09 35/ 393 |
| cancer : Yes | 3386 | 0.04 46/1164 | 0.10 92/ 924 | 0.14 130/ 905 | 0.25 98/ 393 |
| stroke : Yes | 3386 | 0.01 15/1164 | 0.04 33/ 924 | 0.06 51/ 905 | 0.10 39/ 393 |
| a b c represent the lower quartile a, the median b, and the upper quartile c for continuous variables. x ± s represents X ± 1 SD. N is the number of non-missing values. | |||||
Continuous variables by age and gender
Distribution of systolic blood pressure
Distribution of cholesterol
Distribution of BMI
Distribution of wear time
Physical activity data
Distribution of MVPA
Distribution of MVPA and Total log activity count by time of day
Section session info
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18363)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] gridExtra_2.3 naniar_0.6.0 corrplot_0.84 gtsummary_1.3.6
## [5] Hmisc_4.4-2 Formula_1.2-4 survival_3.1-12 lattice_0.20-41
## [9] plotly_4.9.3 forcats_0.5.1 stringr_1.4.0 dplyr_1.0.4
## [13] purrr_0.3.4 readr_1.4.0 tidyr_1.1.2 tibble_3.0.6
## [17] ggplot2_3.3.3 tidyverse_1.3.0 here_1.0.1
##
## loaded via a namespace (and not attached):
## [1] nlme_3.1-148 fs_1.5.0 usethis_2.0.1
## [4] lubridate_1.7.9.2 RColorBrewer_1.1-2 httr_1.4.2
## [7] rprojroot_2.0.2 tools_4.0.2 backports_1.2.1
## [10] R6_2.5.0 rpart_4.1-15 mgcv_1.8-31
## [13] DBI_1.1.1 lazyeval_0.2.2 colorspace_2.0-0
## [16] nnet_7.3-14 withr_2.4.1 tidyselect_1.1.0
## [19] compiler_4.0.2 cli_2.3.0 rvest_0.3.6
## [22] gt_0.2.2 htmlTable_2.1.0 xml2_1.3.2
## [25] labeling_0.4.2 bookdown_0.21 scales_1.1.1
## [28] checkmate_2.0.0 digest_0.6.27 foreign_0.8-80
## [31] rmarkdown_2.6 base64enc_0.1-3 jpeg_0.1-8.1
## [34] pkgconfig_2.0.3 htmltools_0.5.1.1 dbplyr_2.1.0
## [37] highr_0.8 htmlwidgets_1.5.3 rlang_0.4.10
## [40] readxl_1.3.1 rstudioapi_0.13 farver_2.0.3
## [43] generics_0.1.0 jsonlite_1.7.2 crosstalk_1.1.1
## [46] magrittr_2.0.1 Matrix_1.2-18 Rcpp_1.0.6
## [49] munsell_0.5.0 lifecycle_0.2.0 visdat_0.5.3
## [52] stringi_1.5.3 yaml_2.2.1 grid_4.0.2
## [55] crayon_1.4.1 haven_2.3.1 splines_4.0.2
## [58] hms_1.0.0 knitr_1.31 pillar_1.4.7
## [61] reprex_1.0.0 glue_1.4.2 evaluate_0.14
## [64] latticeExtra_0.6-29 data.table_1.13.6 broom.helpers_1.1.0
## [67] modelr_0.1.8 png_0.1-7 vctrs_0.3.6
## [70] rmdformats_1.0.1 cellranger_1.1.0 gtable_0.3.0
## [73] assertthat_0.2.1 xfun_0.20 broom_0.7.4
## [76] viridisLite_0.3.0 cluster_2.1.0 ellipsis_0.3.1